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BIOLOGICAL DISCOVERY USING GENE REGULATORY 
NETWORKS GENERATED FROM MULTIPLE-DISRUPTION 
EXPRESSION LIBRARIES 

5 Related Applications 

This application claims priority under 35 U.S.C. Section 119(e) to U.S. 
provisional applications serial numbers 60/325,016, filed September 26, 2001; 
60/334,372, filed November 29, 2001; 60/334,255, filed November 29, 2001; 
60/334,230, filed November 29, 2001; 60/370,824, filed April 8, 2002; and 
10 60/397,458 filed July 19, 2002. Each of the above provisional patent 

applications is herein incorporated folly by reference. 

Field of the Invention 

Embodiments of this invention relate to methods for elucidating network 
relationships between genes. Methods include Boolean logic, Bayesian 
15 estimation, maximum likelihood analysis, spline functions and other curve- 

fitting methods, and methods to determine edge relationships between groups of 
genes that are functionally related to each other. 

BACKGROUND 

20 Gene regulatory networks elucidated from strategic, genome-wide 

experimental data can aid in the discovery of novel gene function information 
and expression regulation events from observation of transcriptional regulation 
events among genes of known and unknown biological function. 

Methods designed to elucidate gene regulation pathways have been 

25 reported previously 0 ,2,3) . The inferred networks reported in these studies were 

derived from gene expression data sets derived from time course, cell cycle and 
environmental perturbation (4j5) . However, control relationships inferred from 
such data sets are suspect since they are not based on comprehensive 
experimental data designed to elucidate transcription-related regulatory control 
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functions. To rigorously and precisely identify novel and complex gene 
regulatory networks from de novo expression data sets, a systematic and 
integrated strategy of expression experiments on genomic deletion mutants 
combined with suitable computational methods is necessary (6 " 9) . 
5 Of particular importance in the creation of inferred regulatory networks 

is the biological relevance of the expression experiments from which the 
putative control relationships are derived. Genome-wide expression data 
generated from competitive hybridization disruption experiments offer the 
advantages of internal control and quantitative measurement of the direct effects 
10 of a gene's presence or absence on the expression of other genes. Selection of 

disruptant experiments to maximize the elucidation of control relationships is 
valuable for generating useful gene regulatory information. 

SUMMARY 

The present invention provides methods useful for establishing 

1 5 interrelationships among genes or groups of genes in a system, e.g., establishing 

gene pathways or gene networks in a system. In one embodiment, gene 
networks or gene pathways can be constructed based on analyzing gene 
disruption expression profiles using improved Boolean methods, improved 
Bayesian methods, or a combination of Boolean and Bayesian methods, as 

20 provided by the present invention. In another embodiment, gene networks or 

gene pathways can be constructed based on analyzing expression profiles of 
genes affected by an agent, e.g., a drug. In yet another embodiment, gene 
networks or gene pathways affected by an agent can be constructed by 
analyzing the gene networks or pathways obtained from the gene disruption 

25 expression profiles and the agent affected expression profiles. 

In general, gene disruption expression profiles can be obtained based 
on expression profiles of a library of genes, each of which disrupted 
individually or in combination with others, e.g., other genes in a related function 
to provide an expression profile. For example, a library of genes can be 

30 selected (e.g., an entire genome or a series of genes selected based on other 
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criteria, including polymerase chain reaction (PGR) primers). Regardless of 
how the library of genes is selected, once selected, each gene of the library can 
be disrupted individually and/or in combination with others resulting in a 
collection of libraries, each of which containing at least one disrupted gene 
5 along with other non-disrupted genes. Thus, if one selects 100 genes, the 

resulting library will consist of at least 101 different sub-libraries e.g., one sub- 
library of non-disrupted genes or wild-type and at least one disruptant sub- 
library for each of the 100 genes. Thus, libraries of disrupted genes can be 
created and expression profiles for each sub-library and the entire library can be 

1 0 obtained. 

Agent affected expression profiles can be obtained by administering 
one or more desired agents to a system containing selected genes and collecting 
expression profiles of the genes at different dosage or time point of agent 
administration. In some cases the agent has no effect on gene expression, in 

15 other cases the agent is inhibitory (e.g., decreases gene expression) and in other 

cases the agent increases gene expression (e.g., is an inducer). 

Gene expression profiles can be obtained in a quantified form using 
any suitable means, e.g., using microarray. In one embodiment, a gene 
expression profile can be organized into gene expression matrix, e.g., a binary 

20 matrix categorizing the genes into affected and not affected. In another 

embodiment, one can introduce an "equivalence set" in which data is 
normalized, thereby revealing the quantitative relationships in gene expression. 
From an equivalence set, network information relating the genes with each other 
can be created. Then, one can use networks to determine functional 

25 relationships between genes. One can then use the deduced functional 

relationships between genes to predict drug and or biological effects. Then, 
those predictions can be experimentally tested using, for example, mircoarray 
experiments. This process can result in what we call a "final common pathway" 
of gene expression, that is, alteration in the function of one gene results in 

30 effects on genes "downstream" from the altered gene. 
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According to one embodiment of the present invention, one can 
analyzing "downstream" effect from a directly affected gene by using the 
improved Boolean methods of the present invention. According to another 
embodiment of the present invention, one can go "upstream" from an affected 
5 gene by applying an improved Bayesian method of the present invention. The 

present invention provides a new approach to Bayesian gene network analysis, 
in which non-linear, and/or non-parametric regression models are used. Using 
this approach means that one need not make any apriori assumptions about 
causal relationships, but rather can infer causal relationships, thereby providing 

10 "upstream" or "initial pathways" by which a gene observed to be affected by a 

drug or treatment is induced to alter its expression by other genes in an 
upstream relationship to the affected, observed gene. The present invention 
provides an improved criterion, herein termed "BNRC" for estimating a 
Bayesian graph of gene networks. Thus, the gene relationships are selected that 

1 5 minimize the BNRC criterion. 

In other embodiments, other new methods are provided that not 
dependent upon Gaussian or other assumed variance in the data. Rather, in 
certain embodiments, we can measure the variance of the data and use the 
observed variance to affect the BNRC criterion. Using such non-parametric 

20 regression with heterogeneous error variances and interactions, one can 

optimize the curve fitting of data obtained, and predict how many experiments 
are needed to obtain data having a desired variance. Using methods such as 
these, one can obtain gene network information that has a desired degree of 
accuracy and reliability, and provides both upstream and downstream effects. In 

25 some embodiments, the new Bayesian model and penalized maximum 

likelihood estimation (PMLE) yield similar results. 

In still other embodiments, relationships between two genes can be 
elucidated by analysis of a graph of expression of one gene compared to the 
expression of another gene. Such graphs may be linear or non-linear. To 

30 characterize the relationships, in some embodiments, linear splines, B-splines, 
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Fourier transforms, wavelet transforms, or other basis functions can be used. In 
some cases, B-splines are conveniently employed. 

Determining whether a particular gene or series of genes is important 
in regulating other genes can be very useful. However, in certain cases, outliers 
5 in the data can complicate interpretation of results. This problem can be 

especially troubling where outliers are near boundaries of gene groups in the 
overall network. Thus, in certain embodiments, boundary effects can be 
elucidated, in which the edge intensity and the degree of confidence of the 
direction of Bayes causality can be determined. 
10 In other embodiments, time-course of gene expression can be 

determined using linear splines. Using splines, time-ordered data can be more 
reliably analyzed that prior art, "fold-change" methods, which can be relatively 
insensitive. 

Using one or more of the above-described general methods, the present 
15 invention provides gene network information that is useful and validated by 

certain results already known for the yeast genome. However, the methods are 
widely applicable to any genetic network (e.g., "transcriptome"), and to 
interactions at the level of proteins ("proteome"). Additionally, using one or 
more of the new methods, we have determined novel relationships between 
20 functional genes relating to antifungal therapies. Thus, using methods of this 

invention, one can predict putative therapeutic targets based on an 
understanding of the initial and final common pathways of gene expression. 

BRIEF DESCRIPTION OF THE DRAWINGS 
25 This invention is described with respect to specific embodiments 

thereof, in which: 

Figure 1 depicts a schematic diagram of model-based interactive 
biological discovery of this invention. 
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Figure 2 depicts construction of a gene regulatory sub-network model, 
using a Boolean method of this invention. Figure 2 A depicts a Gene Expression 
Matrix, in which numerous profiles are integrated. Each element of the matrix 
represents a ratio of gene expression between a gene in a column and a gene in a 
5 row. Figure 2 B depicts binary relationships between genes to produce a Binary 

Matrix. If gene C G1' is deleted and the intensity of gene C G2 ? is 
significantly altered as a result, then gene 'Gl ' affects gene 'G2' . Figure 
2 C depicts an adjacency matrix of identified looped regulation genes. If gene 
£ G3' and gene 'G4' affect each other mutually, they form a loop 

1 0 (strongly connected component) regulation. Figure 2D depicts a step in which 

genes are partitioned into an Equivalence Set, which treats a loop logically as a 
"virtual gene". An equivalence set is a group of genes that affect each other or 
as a group affect one discrete gene. Figure 2E depicts a skeleton relation 
between virtual genes. The shortest path relationships should be selected to 

15 build hierarchical connections. Figure 2F depicts a regulation pathway formed 

from the skeleton matrix. 

Figure 3 depicts results of a disruptant-based study and analysis of this 
invention in yeast in which 552 nodes representing the included genes and 2953 
putative regulatory links among these genes. 

20 Figure 4 depicts a transcription factor regulatory network model of this 

invention classified by cellular functional roles (CFR) in yeast. In this model, 
there were 98 transcription factors grouped by cellular functional roles 
according to information provided in the Yeast Proteome Database. Genes 
inside the circles are grouped into a given cellular functional category. 

25 Regulatory control relationships are depicted with colored lines. Colors 

indicate the category of genes from which the control relationships emanate. 
Blue: Carbohydrate metabolism, Bluish purple: Chromatin/Cbromosome 
structure, Brown: Energy generation, Dark Green: Other metabolism, Gray: 
DNA repair, Green: Lipid, fatty acid metabolism, Light Green: Amino acid 
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metabolism, Orange: Cell stress, Red: Meiosis/Mating response, Pink: 
Differentiation, Purple: Cell cycle. The bold red lines indicate regulatory control 
of transcription factors related to cell cycle originating from genes in the 
meiosis/mating response group. The bold blue lines indicate control 
5 relationships emanating from the carbohydrate metabolism group exerting 

influence over genes in the lipid fatty-acid metabolism group. Several control 
lines emanating from "carbohydrate metabolism " are depicted. In sharp 
contrast, only 2 individual genes, SKN7 and HMS2 independently influence cell 
cycle related gene expression. 

1 0 Figure 5 depicts a detailed view of a gene regulatory sub-network of 

transcription factors reconstructed from Boolean analysis of gene expression 
experiments on disruptant mutant yeast strains. Black lines indicate a 
regulatory relationship, with the arrows showing the direction of expression 
influence. The colors and shapes of the nodes denote the general categories of 

1 5 cellular function of the gene product according to their descriptions in the YPD. 

Genes related to cell division mechanisms are indicated with triangular nodes 
and genes related to DNA repair and chromosome structure are depicted with 
squares. The elucidated network shows novel topological control relationships 
among genes related to meiosis, the mating response and DNA structure and 

20 repair mechanisms via UME6 and MET28. Genes related to meiosis and mating 

response genes are downstream of a cascade regulated by IN02 and a further 
sub-grouping of genes related to DNA repair and structure appears 
hierarchically downstream of MET28. 

Figure 6 depicts a flow chart showing methods of this invention for the 
25 estimation of network relationships between genes using Bayesian inference 

and minimization of a BNRC criterion. 

Figures 7a and 7b depict conventional methods for analyzing drug 
response expression data. Figure 7a includes list of Yeast genes whose 
expression levels are significantly affected by Griseofulvin exposure at lOOmg 
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at 1 minute (one of 20 experiments). Figure 7b depicts hierarchical clustering of 
expression data from drug response and gene disruption experiments. 

DETAILED DESCRIPTION 
The descriptions that follow include specific examples drawn from 
studies in the yeast, Sarchomyces. The methods for analyzing relationships 
between yeast genes are equally applicable to analysis of relationships among 
genes of different species, including eukaryotes, prokaryotes, and viruses. 
Thus, the descriptions and examples that follow are illustrative and are not 
intended to limit the scope of the invention. 



I- Biological Discovery with Gene Regulatory Networks in Yeast 

Generated from Multiple-Disruption Full Genome Expression Libraries 

Gene regulatory networks elucidated from strategic, genome-wide 
experimental data can aid in the discovery of novel gene function information 
and expression regulation events form observation of transcriptional regulation 
events among genes of known and unknown biological function. Of particular 
importance in the creation of inferred regulatory networks is the biological 
relevance of the expression experiments from which the putative control 
relationships are derived. Genome-wide expression data generated from 
competitive hybridization disruption experiments offer the advantages of 
internal control and quantitative measurement of the direct effects of a gene's 
presence or absence on the expression of other genes. Selection of disruptant 
experiments to maximize the elucidation of control relationships is important 
for generating useful gene regulatory information. To create a reliable and 
comprehensive data set for the elucidation of transcription regulation on 120 
genes with known involvement in transcription regulation. We report several 
novel regulatory relationships between known transcription factors and other 
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genes with previously unknown biological function discovered with this 
expression library. 

We have implemented, for the yeast genome, a systematic, iterative 
approach that combines full-genome biological expression experiments with 
5 gene regulatory inference, as shown in Figure 1. In certain embodiments, we 

start with the creation of a library of hundreds of full genome expression 
experiments, with one gene's expression disrupted per experiment. From this 
data we used computational techniques to infer approximations of gene 
expression regulatory relationships. We then examined the biological relevance 

10 of regulatory relationships with computerized visualization and simulation 

software and validated our findings on novel or biologically interesting sub- 
networks through other databases as well as through further experimentation, 
including combinatorial disruption experiments. 

We constructed a gene expression data library using full-genome yeast 

1 5 c-DNA microarrays. The library was comprised of expression experiments on 

120 yeast strains, each with one gene disrupted by homologous recombination. 
Each of the genes in this library was selected for experimentation because it was 
reported in the Yeast Proteome Database (YPD) to be a factor involved in 
transcription regulation. Previously reported gene regulatory networks show 

20 that genes can interact with themselves and with other regulatory genes (10 \ To 

reconstruct hierarchical regulatory relationships from the expression library, in 
certain embodiments, we developed a novel Boolean algorithm that 
accommodates common looped regulatory relationships. As shown in Figure 2, 
gene regulatory relationships modeled by this method can be represented as a 

25 directed graph of up-regulation or down-regulation of gene expression between 

2 given genes of the 5871 genes measured in the each expression experiments. 
We constructed a 552 genes member model of regulatory control relationships 
and then further constrained a sub-network model composed of 98 well known 
transcription factors. The resultant model, shown in Figure 3, contains a total of 
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552 nodes representing the included genes and 2953 putative regulatory links 
among these genes. 

In certain embodiments, we classified transcription factors in the 
network model according to cellular functional roles (CFR) as defined in YPD. 
5 Figure 4 shows the control relationships among classified transcription factors 

in the network. Only £K/V7and HMS2 directly influenced expression of cell 
cycle related genes. We identified several control lines emanating from 
"carbohydrate metabolism" genes to all other functional gene groups. This 
finding is consistent with the energy dependent nature of many cellular 

1 0 processes and metabolic pathways. 

As shown in Figure 4, a distinct feature is that expression levels of lipid 
fatty-acid metabolism transcription factors were exclusively under control of 
carbohydrate metabolism transcription factors. This relation was given an 
account of interactions among proteins involved in phospholipid synthesis 

1 5 pathway with the glucose response pathway, the lipid signaling pathway and 

other lipid synthesis pathways have been reported (n) . 

Our new methods were employed to further explore detailed 
relationships between expression regulatory genes, such as transcription factors 
with regulatory and non-regulatory genes, from all of gene expression 

20 experimental data. One can characterize regulatory roles of genes with 

unreported biological function by virtue of their expression control by and/or 
over genes with known function. Figure 5 shows an example of newly 
elucidated control relationships among transcription factors involved in cell 
division regulation and DNA replication/repair regulation. We found two 

25 discrete functional branches in the sub-network that correspond to cell division 

regulation and DNA replication/repair are linked by UME6 and MET28, 
indicating the important role of these two transcription factors in coordinating 
expression regulation of these interdependent regulatory pathways. MET28, as 
its name suggests, was previously characterized as a transcription factor related 

30 to methionine metabolism (l2 \ The novel putative role for Met28p in regulation 
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of chromosome segregation is supported by its reported interaction with known 
chromosomal segregation component Smclp, as part of a larger nexus of 
chromosomal segregation proteins, in mating-type two-hybrid assays (13) . 

Through sequence analysis of coding sequences and upstream regions 
5 of genes in the above-mentioned sub network, we validated the sequence level 

control mechanisms between transcription factors and their target genes and 
DNA binding sequences. In the case of UME6, which is known as a global 
transcriptional regulation of many meiotic genes (14 ' 15) and its control system, 
we performed Multiple Expectation-Maximization for motif Elicitation 

10 (MEME) analysis of regions upstream of the 34 genes controlled by UME6 in 

our model (16) . We found two consensus sequences, TAGCCGCCGA (SEQ ID 
NO: 1) and TGGGCGGCTA (SEQ ID NO: 2) that were present in 14.7% and 
32.4% of the 34 genes respectively, and that had significant P values according 
to the MEME search. According to the TRANSFAC database, 

15 TAGCCGCCGA is defined as the binding site of Ume6p (14) and 

TCGGCGGCA is reported to be the binding site of a repressor of CAR1 (17) 
which were repressed by a three components complex containing Ume6p 
(Ume6p, Sin3p and Rpd3p) (18) . Aside from the Ume6p related binding motifs, 
no other MEME consensus sequence was present upstream of the 1 1 meiosis 

20 related genes, a finding which suggests that these 11 genes are regulated 

exclusively by UME6 and that Ume6p directly influences their expression. Only 
two other genes possessed the putative binding sequence but did not show 
expression influence on the count of UME6 in our experiments. 

Experimentally driven discovery of network models of expression 

25 control allows for specific biological insights relevant to gene regulatory 

pathways that are not readily reconstructed from the available biological 
literature or present in pre-compiled pathway databases. We have shown here 
that such a system is useful in discovering novel gene function information as 
well as novel regulatory mechanisms. Use of this and similar strategies to 

30 elucidate hierarchical regulatory pathways from full genome expression 
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libraries will allow for rapid insight into transcription regulation that can be 
applied to rational drug discovery and agrochemical targeting. 

Methods 

5 Boolean Network Inference Algorithms 

A gene expression matrix E is created from a set of gene disruption 
experiments. The value of matrix element E(a,b) indicates the expression ratio 
of gene £ Z>' to the normal condition in the disruptant in which gene V which is 
caused by the deletion of gene 'a\ 

10 (1) Using the gene expression matrix E, if the intensity of gene c b' is 

changed higher than a given threshold value <9, or is changed lower than a given 
threshold 1/8 resulting from the disruption of gene c a\ it is defined that gene 
V affects gene c b 9 in directly or indirectly and the value of element (a,b) in the 
binary matrix R is set to 1; R(a,b)^l. Thus the binary matrix R is created by 

15 cutting the value of each element in the gene expression matrix E at the 

threshold (0 or 1/0). 

(2) In the binary matrix R, if there is the relation that gene V and L V affect 
each other, that is R{a,b)^R{b,a)^l !> we cannot decide which gene is located at 
the upper stream. This is the limitation or disadvantage of this method, 

20 however, we introduce an equivalence set, which makes a set of the group 

consisting of genes affecting each other and the group is assumed to be one 
gene. 

(3) Ordering genes (topological sort) Equivalence sets have the semi-order 
relation and we can be drawn up equivalence set in semi-order (topological sort) 

25 to infer the network 

(4) Skeleton matrix; semi-ordered accessibility matrix between equivalence 
sets includes indirect affections. In order to remove them and to make a 
skeleton matrix, we set the rank to each equivalence set defined as follows: 
Equivalence set belonging to rank 1 gives no indirect affection to another 

30 equivalence sets. Equivalence set belonging to rank 3 gives direct affection to 
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the sets with rank 2 and does indirect affection to ones with rank 1. After 
setting the rank to each equivalence set, remove all indirect affection from the 
semi-ordered accessibility matrix. 

(5) Draw multi-level digraph; draw the lines between nodes based on the 
5 value of each element in the skeleton matrix. 

Microarray experiments 

Information on gene expression can be obtained using any conventional 
methods known in the art. For example, microarrays can be used in which 

1 0 complementary DNAs (cDNA) unique to each gene whose expression is to be 

measured are placed on a substrate. RNA obtained from samples can be 
analyzed by hybridization to corresponding cDNA, and can be detected using a 
variety of methods. In some embodiments, it can be desirable to use a 
microarray detectors and/or methods as described in United States Provisional 

15 Patent Application, Serial No: 60/382,669, filed May 23, 2002, herein 

incorporated fully by reference. 

We collected gene expression data using full-genome yeast c-DNA 
microarrays {13) . BY4741 (MATa, HIS3D1, LEU2D0, MET15D0, URA3D0) 
served as the wild type strain. Gene disruptions for strain BY4741 were 

20 purchased from Research Genetics, Inc. Cells were inoculated and grown in 

YPD (1% yeast extract, 1% bacto-peptone, 2% glucose) at 30°C until OD 600 
reached 1.0 in the logarithmic growth phase and harvested to isolate niRNA for 
assay of gene expression. The parental strain was the control used for each 
disruptant strain. 

25 

Data Normalization 

We measured the quantities of 5871 mRNA species from 155 
disruptants by cDNA microarray assay. A difference in fluorescent strength 
between Cy3, Cy5 causes bias of the expression quantity ratio. We normalized 
30 the expression quantity ratios of each expression profile. The ratio bias had a 
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fixed trend in each spotted block, thus we calculated a linear regression to 
normalize the mean value ratio of each block to 1 .0. The logarithm value of the 
ratio was used to indicate the standard expression level, therefore we found the 
logarithm value of ratio and calculated the average and standard deviation of 
these log values (see tablel). The Standard Deviation (SD) of expression levels 
of all spotted genes from the UME6 (YDR207Q disruptant expression array for 
which UME6 is defined as a "Global Regulator" in YPD (33) disruptant was 
0.4931, therefore we recognize that there may be an unacceptable number of 
errors in array data whose overall SD was larger than 0.5. Thus, in certain 
embodiments, one can eliminate such experiments from analysis. It can be 
appreciated that one can select the SD of the data in different ways. By 
selecting a SD of less than 0.5, one can take a relatively conservative approach, 
in that errors of the above type are relatively unlikely to affect the overall gene 
network inference. However, one can appreciate that SD of less than 0.4, less 
than 0.3, less than 0.2, less than 0.1, less than about 0.05, less than about 0.01, 
less than about 0.005, or less than about 0.001 can be selected. 

Selection of Genes 

In YPD, 314 genes were defined as "Transcription Factors", and 98 of 
these have previously been studied for control mechanism. The expression 
profile data of 552 genes including the genes that controlled by this 98 
"Transcription factors" were selected from 5871 profiles. Thus we constructed 
the gene regulatory network from the expression profile data set based on the 
values of these 552 genes in 120 gene disruption experiments. 
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25 II. Estimation of Genetic Networks and Functional Structures Between 

Genes Using Bayesian Networks and Nonparametric Regression 

1. Introduction 

Improvement of the genetic engineering provides us enormous amount 
30 of valuable data such as microarray gene expression data. The analysis of the 

relationship among genes has drawn remarkable attention in the field of 
molecular biology and Bioinformatics. However, due to the cause of 
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dimensionality and complexity of the data, it will be no easy task to find 
structures, which are buried in noise. To extract the effective information from 
biological data, thus, theory and methodology are expected to be developed 
from a statistical point of view. Our purpose is to establish a new method for 
5 understanding the relationships among genes clearer. Example 1 below 

provides the mathematical basis for these methods. 

In certain embodiments of this invention, Bayesian networks are adopted 
for constructing genetic networks using microarray gene expression data from a 
graph-theoretic approach. Friedman and Goldszmidt (1998) proposed an 

1 0 interesting method for constructing genetic links by using Bayesian networks. 

They discretized the expression value of genes and considered to fit the models 
based on multinomial distributions. However, a problem still remains in 
choosing the threshold value for discretizing (by not only the experiments). The 
threshold value assuredly gives essential changes of affects of the results and 

15 unsuitable threshold value leads to incorrect results. On the other hand, 

recently, Friedman et al (2000) pointed out that discretizing could result in 
losses of information. To use the expression data as continuous values, they 
used Gaussian models based on linear regression. However this model can only 
detect linear dependencies and cannot produce a complete picture of 

20 relationships. We developed a new method for constructing genetic networks 

using Bayesian networks. To capture not only linear dependencies but also 
nonlinear structures between genes we use nonparametric regression models 
with Gaussian noise. Nonparametric regression has been developed in order to 
explore the complex nonlinear form of the expected responses without the 

25 knowledge about the functional relationship in advance. Due to the new 

structure of the Bayesian networks, a suitable criterion was needed for 
evaluating models. Thus, this invention includes a new criterion from Bayesian 
statistics. By using these methods we overcame defects of previous methods 
and attained more information. In addition, our methods include previous 
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methods as special cases. We validated our methods through the analysis of the 
S. cerevisiae cell cycle data. 

We found that using Bayesian analysis according to these methods, we 
could identify upstream causative relationships between genes, thereby 
identifying potential therapeutic targets. 

2. Bayesian Networks and Nonparametric Regression 

Using a Bayesian network framework, we consider a gene as a random 
variable and decompose the joint probability into the product of conditional 
probabilities . For example, if we have a series of observations of the random 
vector, we can denote the probability of obtaining a given observation can 
depend upon the conditional probability densities. In certain embodiments, one 
can use nonparametric regression models for capturing the relationships 
between the variables. A variety of graphic tools can be used to elucidate the 
relationships. For example, polynomials, Fourier series, regression spline gases, 
B-spline bases, wavelet bases and the like can be used for defining a graph of 
gene relationships. One difficulty in selecting a proper graph is to properly 
evaluate variance and noise in the system. 

3. Criterion for Choosing a Proper Graph 

In certain embodiments, we can let tz(0 g IX) be the prior distribution on 
the unknown parameter 0 G with hyper parameter vector A and let log tt (0 g \X) = 
0(n). The marginal probability of the data X n is obtained by integrating over 
the parameter space, and we choose a graph G with the largest posterior 
probability. Friedman and Goldszmidt (1998) considered the multinomial 
distribution as the Bayesian network model and also supposed the Dirichlet 
prior on the parameter 6 G . In this case, the Dirichlet prior is the conjugate prior 
and the posterior distribution belongs to the same class of distribution. Then a 
closed form solution of the integration in (4) is obtained, and they called it BDe 
score for choosing graph. A BDe score is confined to the multinomial model, 
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and we propose a criterion for choosing graph in more general and various 
situations. 

A problem of constructing criteria based on the above model is how to 
compute the integration. Some methods that can be used include Markov chain, 
and Monte Carlo methods. In certain embodiments of this invention, we use the 
Laplace approximation for integrals. Thus, the optimal graph is chosen such 
that the criterion BNRC in Example 1 equation (5) is minimized. 

This criterion is derived under log (0 G IX) = 0(h). If log (0 G \X) = 0(1), 
the mode 0 G is equivalent to the maximum likelihood estimate, MLE, and the 
criterion resulted in a Bayesian information criterion, known as BIC by 
removing the higher order terms 

0 (n 3 )(j>0). Konishi (2000) provided a general framework for constructing 
model selection criteria based on the Kullback-Leibler information and Bayes 
approach. As can be seen from Example 1 below, the final graph can be 
selected as a minimizer of the BNRC and does not have to minimize each local 
score, BNRC/, because the graph is constructed as acyclic. 

4. Estimating Graphs and Related Structures Using a BNRC 
Criterion 

For example, one can express our method in an illustration. Essential 
points of our method are the use of the nonparametric regression and the new 
criterion for choosing graph from Bayesian statistics. As for nonparametric 
regression in Section 2 of the this Example 1, we use 5-splines as the basis 
functions. Figure 1 of Example 1 is an example of 5-splines of degree 3 with 
equidistance knots t\ . . . , 1 1 0 . By using a backfitting algorithm, the modes can 
be obtained when the values of j3 jk are given. The backfitting algorithm is 
shown in Example 1 below. 

The modes depend on the hyper parameters /3 jK and we have to choose 
the optimal value of /3 jk . In the context of our method the optimal values of jB jk 
are chosen as the minimizers of BNRQ. 
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The 5-splines coefficient vectors can be estimated by maximizing 
equation (6) of Example 1. The modes of equation (6) are the same as the 
penalized likelihood estimates and we can look upon the hyper parameters X jk or 
ySjk as the smoothing parameters in penalized likelihood. Hence, the hyper 
parameters play an important role for fitting the curve to the data. 

5. Computation Experiments 

We used Monte Carlo simulations to examine the properties of our 
method. Data were generated from an artificial graph and structures between 
variables (Figure 2 of Example 1) and then the results summarized as follows: 
The criterion BNRC can detect linear and nonlinear structure of the data. A 
BNRC score may have a tendency toward over growth of the graph. Therefore, 
we use Akaike's information criterion, known as AIC and use both methods. 
AIC was originally introduced as a criterion for evaluating models estimated by 
maximum likelihood methods. But estimates by this method are the same as the 
maximum penalized likelihood estimate and is not MLE. The trace of S jk shows 
the degree of freedom of the fitted curve and is a great help. That is to say, if 
trSjk is nearly 2, the dependency can be considered linear. We use both BNRC 
and AIC for deciding whether to add a parent variable. 

To validate these methods, we analyzed the S. cerevisiae cell cycle data 
discussed by Spellman et al and Friedman et al. The data were collected from 
800 genes and 77 experiments. We set the prior probability 7i G as constant, 
because we have no information about the size of the true graph. The 
nonparametric regressors are constructed of 20 i?- splines. The number of B- 
splines can be varied as desired. The hyperparameters control the smoothness 
of fitted curve and it can be desirable to select hyper parameters and the number 
of i?-splines to provide a good fit to the data with a minimum of B-splines. 

The results of the analysis can be summarized as follows: Figure 3 of 
Example 1 shows BNRC scores when we predicted CLN2, CDC5 and SVS1 by 
one gene. The genes which give smaller BRNC scores produce a more reliable 



-20- 



WO 03/027262 



PCT/US02/31093 



effect on the target gene. We can observe that which gene is associated with the 
target gene and we find the gene set which strongly depend on the expression of 
the target gene. In face, we can construct a brief network by using these 
information. We can look upon the optimal graph as a revised version of the 
brief network by taking account of the effect of interactions. If there is a linear 
dependency between genes, the score BNRC is good when the parent-child 
relationship is reversed. Therefore, the directions of the causal associations in 
the graph are not strict, especially when the dependencies are almost linear. 
There are some genes that mediate Friedman et at '$ result, such as MCD1, 
CSI2, YOX1 and so on. Most of these genes have been reported to play an 
important role. A large number of the relationships between genes are nearly 
linear. But we could find some nonlinear dependencies which linear models 
can rarely find. 

Figure 5 of Example 1 shows the estimated graph associated with genes 
which were classified by their processes into cell cycle and their neighborhood. 
We omitted some branches in Figure 5, but important information is shown. As 
for the networks given by us and Friedman et al, we confirmed parent-children 
relationships and observed that both two networks are similar to each other. 
Especially, our network includes typical relationships which were reported by 
Friedman et at As for the differences between both networks, we attention the 
parents of SVS1. Friedman et al. employed CLN2 and CDC5 as the parent 
genes of SVS1. On one hand, our result gives CSI2 and YKR090W for SVS1. 
Our candidate parent genes were found to be more appropriate than those of 
Friedman et al. Our model suitably fits to both cases in Figure 4 of Example 1. 
Especially, we conclude that CDC5 has weak effects on SVS1 compared with 
other genes in Spellman's data (see also Figure 3 of Example 1) In fact, as the 
parent gene of SVS1, the order of BNRC score of CDC5 is 247 th . Considering 
the circumstances mentioned above, our method can provide us valuable 
information in understandable and useful form. 
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Figure 6 schematically depicts a method of this invention for 
determining gene network relationships. 

6. Discussion 

The new methods for estimating genetic networks from microarray gene 
expression data by using Bayesian network and nonparametric regression 
provide improved accuracy. We derived a new criterion for choosing graph 
theoretically, and represented its effectiveness through the analysis of the cell 
cycle data. The advantages of our methods include (1) we can use the 
expression data as continuous values; (2) we can also detect nonlinear structures 
and can visualize their functional structures being easily understandable; and (3) 
automatic search can accomplish the creation of an optimal graph. 

Friedman et aPs method retains known parameters such as threshold 
value for discretizing and hyper parameters in the Dirichlet priors selected by 
trial and error and were not optimized in a narrow sense. On the other hand, our 
method can automatically and appropriately estimate any parameter based on 
criterion which has sound theoretical basis. 

In other embodiments we can derive criterion BNRC in more general 
situations. By way of example, we can construct graph selection criterion based 
on other statistical models. 

III. Nonlinear Modeling of Genetic Network by Bayesian Network and 
Nonparametric Regression with Heterogeneous Error Variances and 
Interactions 

In other embodiments, one can use different statistical methods for 
constructing genetic networks from microarray gene expression data based on 
Bayesian networks. In these embodiments, we estimate the conditional 
distribution of each random variable. We consider fitting the nonparametric 
regression, models with heterogeneous error variances and interactions for 
capturing the nonlinear structures between genes. Once we set a graph using 
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Bayesian network and nonparametric regression, a problem still remain to be 
solved in selecting an optimal graph, which gives a best representation of the 
system among genes. A new criterion for choosing graph from Bayes approach 
contains the previous methods for estimating networks using Bayesian methods. 
5 We demonstrated the effectiveness of proposed method through the analysis of 

Saccharomyces cerevisiae gene expression data newly obtained by disrupting 
100 genes. 

1. Introduction 

1 0 Use of Bayesian networks can be effective methods in modeling the 

phenomena through the joint distribution of a large number of variables. 
Friedman and Goldszmidt discretized the expression values and assumed the 
multinomial distributions as candidate statistical models. Peter et al. 
investigated the threshold value for discretizing. Friedman et al. pointed out 

15 that discretizing can lose some information considered to fit the linear 

regression models, which analyze the data as continuous. However, the 
assumption that the parent genes depend linearly on the objective gene is not 
always warranted. Imoto et al described the use of nonparametric additive 
regression models for capturing not only linear dependencies but also nonlinear 

20 relationships between genes. 

In certain embodiments, we use Bayesian networks and the 
nonparametric heteroscedastic regression which can be more resistant to effect 
of outliers and can also capture effects of the interactions of parent genes. 

After setting the graph, we evaluate its goodness or closeness to the true 

25 graph, which is completely unknown. The construction of a suitable criterion 

becomes important. Friedman and Goldszmidt derived the criterion, BDe, for 
choosing a graph based on multinomial models and Dirichlet priors. However, 
unknown hyper parameters in Dirichlet priors remain and one can only set the 
values experientially. We derived a new criterion for choosing a graph from 

30 Bayes approach. The criterion automatically optimizes the all parameters 
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model and gives the optimal graph. In addition, our method includes the 
previous methods for constructing genetic network by Bayesian network. To 
show the effectiveness of the new method we analyze gene expression data of 
Saccharomyces cerevisiae by disrupting 100 genes. 

2. Bayesian Network and Nonparametric Heteroscedastic 
Regression Model With Interactions 

Suppose that we have n sets of data {x u ...,x n }of the ^-dimensional 

random variable vector X = (Xi . . . , X p ) r where x t = x/i, . . . x iv ) T and x T 
denotes the transpose of x. In microarray gene expression data, n and p 
correspond to the numbers of arrays and genes. Under Bayesian network 
framework, we considered a directed acyclic graph G and Markov assumption 
between nodes. The joint density function is then decomposed into the 
conditional density. Assuming that the conditional densities are parameterized 
by the parameter vectors & } and the effective information is extracted from these 
probabilistic models. 

We then used nonparametric regression strategy for capturing the 
nonlinear relationships between Xy and p,y. In many cases, this method can 
capture the objective relationships very well. When the data, however, contain 
outliers especially near the boundary on the domain {p,,}, nonparametric 
regression models sometimes can lead to unsuitable smoothed estimates, i.e., 
the estimated curve exhibits some spurious waviness due to the effects of 
outliers. To avoid this problem, the nonparametric regression model with 
heterogeneous error variances as described below in Example2. 

Criterion For Choosing a Graph 

Once we set a graph, a statistical model (equation 8 of Example 2) based 
on the Bayesian network and nonparametric regression can be constructed and 
can be estimated by a suitable method. However, to choose the optimal graph, 
which gives a best approximation of the system underlying the data, it is 
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undesirable to use the likelihood function as a model selection criterion, because 
the value of likelihood becomes larger in a more complicated model. Hence, a 
statistical approach based on the generalized or predictive error, Kullback- 
Leibler information, Bayes approach and so on (20) can be desirable. We 
5 therefore constructed a criterion for evaluating a graph based on our model 

(equation 8 of Example 2) from Bayes approach. 

A criterion for evaluating the goodness of the graph can be constructed 
from Bayesian theoretic approach as follows: 

(1) Obtain the posterior probability of the graph by the 
1 0 product of the prior probability of the graph n G and the 

marginal probability of the data; 

(2) Remove the standardized constant, the posterior 
probability of the graph is proportional to equation 9 of 
Example 2. 

1 5 (3) Using Bayes approach, choose a graph such that tt(G I X n ) 

is maximum. 

(4) Determine if the computation involves a high 
dimensional integral (equation 9 of Example 2). 

(5) If Yes to (4), use equation 11 of Example 2, [see 
20 references 17 and 26 of Example 2 for integrals. 

Thus, a graph is chosen such that the criterion BNRC is minimal. An 
advantage of using the Laplace method is that it is not necessary to consider the 
use of the conjugate prior distribution. Hence modeling in larger classes of 
25 distributions of the model and prior is attained. 

3. Estimating Genetic Network 

Nonparametric Regression 

We present methods for constructing a genetic network based on the 
30 method described in Section 2 above. The nonparametric regressor (equation 5 
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of Example 2) has the two components: the main effects component represented 
by the additive model of each parent gene, and the interactions component. In 
the additive model, we construct each smooth function m Jk Q by 5-splines 
(equation 9 of Example 2) (see referenced). 

In the interaction terms, we use a Gaussian radial basis function. In the 
context of the regression modeling based on the radial basis functions, we used 
two methods for estimating the centers z n and the widths. This method can be 
termed "fully supervised learning." An alternative method determines the 
values by using only the parent observation data in advance. The latter method 
can be employed, and a k-means clustering algorithm for constructing the basis 
functions can be used. Details of the radial basis functions are described further 
in references 7, 21 and 23 of Example 2. Hyper parameters control the amount 
of overlapping among basis functions. 

In error variances, we consider a heterocedastic regression model and 
assume the structure shown in equation 6of Example 2. The design of the 
constants can affect the accuracy of capturing the heteroscedasticty of the data. 
We set the weights according to Equation 12 of Example 2. If the hyper 
parameter p is set to 0, then the variances are homogeneous. If we use a large 
value of p, then the error variances of the data, which exist near the boundary on 
the domain of the parent variables, are large. Hence, if there are outliers near 
the boundary, this method can reduce their effects and thereby gain suitably 
smoothed estimates using appropriate values of p. 

4. Real Data Analysis 

We demonstrated the effectiveness of the above methods through 
analysis of S. cerevisiae gene expression data. We observed the changes of 
expression level of genes on a microarray by gene disruption. By using this 
method we revealed gene regulatory networks between genes of Saccharomyces 
cerevisiae. We collected a large number of expression profiles from gene 
disruption experiments to evaluate genetic regulatory networks. Over 400 
mutants are stocked and gene expression profiles are accumulated. 
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We monitored the transcriptional level of 5871 genes spotted on a 
microarray by a scanner. The expression profiles of over 400 disruptants were 
piled up in our database. The standard deviation (SD) of all genes on a 
microarray was evaluated and the value of SD represented roughly the error of a 
5 experiment. We needed the value of 0.5 as the critical point of the standard 

deviation of the expression ratio of all genes. 107 disruptants including 68 
mutants where the transcription factors were disrupted were selected from 400 
profiles. 

We used 100 microarrays and constructed the genetic network of 521 

1 0 genes from the above data. 94 transcription factors whose regulating genes are 

identified were found, and the profiles of 421 genes controlled by 94 factors 
were selected from 5871 profiles. We constructed the nonparametric regression 
model using 20 5-splines and 20 radial basis functions. We confirmed that the 
differences of the smoothed estimates against the various number of the basis 

1 5 functions can not visually be found, because when we use the somewhat large 

number of the basis functions, the hyper parameters control the smoothness of 
the fitted curves. We applied the two-genes effect to the interaction 
components. Hence, the effects of the interactions were obtained as the fitted 
surfaces and can be visually understandable. 

20 We showed the roles of the hyper parameters in the prior distributions 

and weight constants. Figure 1(a) of Example 2 shows the scatter plot of 
YGL237C and YEL071W with smoothed estimates by 3 difference values of 
the hyper parameters. The smoothed estimate depends on the values of the 
hyper parameters. Figures 1(b) of Example 2 depicts the behavior of the BNRC 

25 criterion of the two genes in Figure 1 (a). We can choose the optimal value of 

the hyper parameter as the minimizers of the BNRC, and the optimal smoothed 
estimates (solid curve) can capture structure between these genes well. The 
dashed and dotted curves are near the maximum likelihood estimate and the 
parametric linear fit, respectively. The effect of the weight constants wlj, . . . , 

30 wnj is shown in Figure 2(a) of Example 2. If we use a homoscedastic regression 
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model, we obtain the dashed curve, which exhibits some spurious waviness due 
to the effect of the data in the upper-left corner. By adjusting the hyper 
parameter p in equation 12 of Example 2, the estimated curve resulted in the 
solid curve. The optimal value of p chosen by minimizing the BNRC criterion 
5 (see Figure 2(b) of Example 2). When the smoothed estimate is properly 

obtained, the optimal value of p tends to zero. 

We employed a two-step strategy for fitting the non-parametric 
regression model with interactions. First, we estimated the main effects 
represented by the additive Z?-spline regressors. Next, we fit the interactions 

10 components to the residuals. Figure 3 of Example 2 depicts an example of the 

fitted surface to the relationship between YIL094C and its parent genes, 
YKL152C and YER055C. The interaction of the two parent genes leads to over 
expression when both parent genes increased. 

In Saccharomyces cerevisiae, the GCN4 gene encodes the 

15 transcriptional activator of the "general control" system of amino acid 

biosynthesis, a network of at least 12 different biosynthetic pathways [reference 
6 of Example 2]. Experiments showed that the consequences of the general 
control response upon the signal "amino acid starvation" induced by the 
histidine analogue 3-aminotriazole with respect to GCN4p levels. GCN4 

20 activates transcription of more than 30 genes involved in the biosynthesis of 1 1 

amino acids in response to amino acid starvation or impaired activity of tRNA 
synthetases (see reference 24 of Example 2). Purine biosynthetic genes ADE1, 
ADE4, ADE5,7 and ADE8 display GCN4-dependent expression in response to 
amino acid starvation [24]. GCN4 activates transcription of biosynthetic genes 

25 for amino acids and purines in response to either amino acid or purine starvation 

[24]. Those results of experiments show that there are strong relationships 
between purine metabolism and amino acid metabolism through GCN4. Our 
map of relationships fit well known relationships between purine and amino 
acid metabolism. 

30 
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IV. Bootstrap Nonlinear Modeling of Genetic Network by Using 
Bayesian Network and Nonparametric Heteroscedastic Regression 

In other embodiments, we modified the above approaches. A relevant 
feature of Bayesian network construction is in the estimation of conditional 
5 distribution of each random variable. We fit non-parametric regression models 

with heterogeneous error variances to the microarray gene expression data to 
capture nonlinear structures between genes. A problem still remained to be 
solved in selecting an optimal graph, which gives the best representation of the 
system among genes. We derived a new graph selection criterion from Bayes 

1 0 approach in general situation. The proposed method includes previous methods 

based on Bayesian networks. We also used a method for measuring the edge 
intensity and the degree of confidence of the direction of Bayes causality in an 
estimated genetic network. We demonstrated the effectiveness of the proposed 
method through the analysis of Saccharomyces cerevisiae gene expression data 

1 5 newly obtained by disrupting 100 genes. 

Once we set the graph, it is desirable to evaluate its goodness or 
closeness to the true graph, which is unknown. For our method we needed to 
establish the method for measuring the edge intensity. We used a "bootstrap" 
method (Efron, 1979; Efron and Tibshirani, 1993) to solve this problem. By 

20 using this method, one can measure not only the intensity of edge, but also the 

degree of confidence of Bayes causality. To show the effectiveness of the 
proposed method, we analyzed gene expression data of Saccharomyces 
cerevisiae newly obtained by disrupting 100 genes. 

25 1. Bayesian Network and Nonparametric Heteroscedastic 

Regression for Non-Linear Modeling of a Genetic Network 
1.1 Nonlinear Baysian Network Model 

As described elsewhere in this application, in the Bayesian network 
framework, one can consider a directed acyclic graph G and Markov 
30 assumption between nodes. The joint density function is then decomposed into 
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the conditional density of each variable. Through formula 1 of Example 3 
below, the focus of interest in statistical modeling by Bayesian networks is how 
one can construct the conditional densities, f}. One can assume that the 
conditional densities, Jj , are parameterized by the parameter vectors, and 
5 information is extracted from these probabilistic models as described further in 

Example 3. 

In certain embodiments, one can use nonparametric regression strategies 
for capturing nonlinear relationships between xij and Pij and suggested that 
there are many nonlinear relationships between genes and the linear model 

10 hardly achieves a sufficient result. In many cases, these methods can capture 

the objective relationships well. When the data, however, contain outliers, 
especially near the boundary of the domain, standard nonparametric regression 
models sometimes lead to unsuitable smoothed estimates, i.e., the estimated 
curve exhibits some spurious waviness due to the effects of the outliers. To 

1 5 avoid this problem, we fit a nonparametric regression model with heterogeneous 

error variances. If the number of the parameters in the model is much larger 
than the number of observations, it has a tendency toward unstable parameter 
estimates. 

In certain cases it can be desirable to use nonparametric regression 
20 instead of linear regression, because linear regression cannot evaluate the 

direction of the Bayes causality or can lead to the wrong direction in many 
cases. We show an advantage of the new model compared with linear 
regression through a simple example. Suppose that we have data of genel and 
gene 2 in Figure 1(a) of Example 3. We consider the two models genel>gene2 
25 and gene2>genel, and obtain the smoothed estimates shown in Figures (b) and 

(c), respectively of Example 3. We then can decide that the model (b: genel 
~>gene2) is better than (c: gene2 -> genel) by the our criterion, which is 
derived in a later section (the scores of the models are (b) 120.6 (c) 134.8). 
Since we generated this data from the true graph gene-^gene2, our method 
30 yields the correct result. However, if we fit the linear regression model to this 
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data, the mode (c) is chosen (the scores of the linear models are (b) 156.0 (c) 
135.8). The method based on linear regression yields an incorrect result in this 
case, whereas non-linear regression analysis yields a correct result. 

Even in the case where the relationship between genes is almost linear. 
Our method and linear regression can fit the data appropriately. However, using 
the linear model it is difficult to decide the direction of Bayes causality. 

We have previously described the criterion BNRC. We derived the 
criterion, BNRC toero , for choosing the graph in a general framework. By using 
the equation (8), the BNRC tem? score of the graph can be obtained by the sum 
of the local scores, BNRQ ctero . The optimal graph is chosen such that the 
criterion BNRC toero (equation 7 of Example 3) is minimal. An advantage of the 
Laplace method is that it is not necessary to consider the use of the conjugate 
prior distribution. Hence the modeling in the larger classes of distributions of 
the model and prior is attained. Genetic networks are estimated as described 
further in Example 3. However, regarding error variances, we consider a 
heteroscedastic regression model and assume the structure shown in equation 3 
of Example 3. 

Learning Network 

In Bayesian network literature, determining the optimal network is an 
NP-hard problem. To resolve the networks, one can use a "greedy hill- 
climbing" algorithm as follows: 

(1) Step 1 : Make the score matrix whose (i, j) -th elements is the 
BNRC he t ero score of the graph gene/ -> gene/; 

(2) Step 2: For each gene, implement one of three procedures for an 
edge: add, remove, reverse, which gives the smallest BNRQ e , ero ; 

(3) Step 3: Repeat Step 2 until the BNRC toero does not reduce 
further; and 

(4) Permute the computational order of genes and make many 
candidate learning orders in Step 3. 
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Step 4 can be desirable in situations in which a greedy hill-climbing 
algorithm produces many local minima and the result depends on the 
computational order of variables. Another problem of the learning network is 
that the search space of the parent genes is wide when the number of genes is 
large. Is such circumstances, one can restrict the set of candidate parent genes 
based on the score matrix, which is given by Stepl . 

One also can use this learning strategy for learning genetic network and 
showed the effectiveness of their method by the Monte Carlo simulation 
method. We also checked the efficiencies of our new model through the same 
Monte Carlo simulations and found improvements due to the nonparametric 
heteroscedastic regression model, We illustrate the effectiveness of the 
heteroscedastic regression model in the next subsection. 

Hyper parameters 

Consider the nonparametric regression model defined in equation 4 of 
Examples. The estimate is a mode of log % (0j \Xn) and can depend on the 
hyper parameters used. We constructed the nonparametric regression model 
using 20 ^-splines. We confirmed that the differences of the smoothed 
estimates against the various number of the basis functions cannot be visually 
detected. When we used a somewhat large number of basis functions, the hyper 
parameters control the smoothness of the fitted curves. (Figure 3(a) of Example 
3 shows the scatter plot of YGL237C and YEL071W with smoothed estimates 
for 3 different values of the hyper parameters. The details of the data are shown 
in later section. The smoothed estimate strongly depended on the values of the 
parameters. Figures 3(b) of Example 3 depicts the behavior of the BNRCWro 
criterion of the two genes in Figure 3(a). One can choose the optimal value of 
the hyper parameter as the minimizer of the BNRC hete ro and the optimal 
smoothed estimate (solid curve in Figure 3(a) can capture the structure between 
these genes well. The dashed and dotted curves are near the maximum 
likelihood estimate and the parametric linear fit, respectively. 
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Effects of the weight constants wij, . . . wnj is shown in Figure 4(a) of 
Example 3. If one vises the momoscedastic regression model, we obtain the 
dashed curve, which exhibits some spurious waviness due to the effect of data 
in the upper-left corner. By adjusting the hyper parameter/?/ in equation 9 of 
5 Example 3, the estimated curve resulted in the solid curve. The optimal value 

of pj can be also chosen by minimizing the BNRCw* criterion see Figure 4(b) 
of Example 3. When the smoothed estimate is properly obtained, the optimal 
value of pj tends to zero. 

Finally, in Section 3 of Example 3, we provide an algorithm for 
1 0 estimating the smoothed curve and optimizing the hyper parameters. 

Step 1 : Fix the hyper parameter pfi 
Step 2: Initialize: y jk = o, k - 1, . . qj\ 
Step 3: Find the optimal $ Jk by repeating Steps 3-1 and 3-2; 

Step 3-1 : Compute: yjk = (B T Jk W jk B jk + nj3 Jk K Jk ) ~ 7 B T Jk W jk x (x® - 2 

for fixed 

Step 3-2: Evaluate: Repeat Step 3-1 against the candidate value of $ Jk 
and choose the optimal value of (3,*, which minimizes the BNRCWro • 
20 Step 4: Convergence: repeat Step 3 for k « 1, . . . , qj, 1, . - - , qj\ 1, . • - until a 

suitable convergence criterion is satisfied. 

Step 5: Repeat Step 1 to Step 4 against the candidate value of pj, and choose the 
optimal value of pj\ which minimizes the BNRC hetero- 

25 Bootstrap Edge Intensity and Degree of Confidence of Bayes 

Causality 

We measured the intensity of the edge and the degree of confidence of 
the direction of the Bayes causality in the estimated genetic network by the 
bootstrap method The algorithm can be expressed as follows: 
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(1) Make a bootstrap gene expression matrix X\ = {x* ly . . . x* n } T . 
by randomly sampling n times, with replacement, from the original gene 
expression data 

{xi, . . . , x n }; 

(2) : Estimate the genetic network from X* n based on the proposed 
method; and 

(3) : Repeat Stepl and Step 2 r times. 

From this algorithm, we obtain T genetic networks. We define the bootstrap 
intensity of edge and direction of Bayes causality as follows: 

Edge intensity 

If the edges genet genej and genej -> genet exist tj and t 2 times in the 
T networks, respectively, we define the bootstrap edge intensity between gene/ 
and genej as (tj + t 2 )/T. 

Degree of Confidence of the Bayes Causality 

lit i > t 2 , we adopt the direction genet -> genej and define that the degree 
of confidence of causality is t/fa + t 2 ). However, we can rise a certain 
threshold. For example, we set a real number 8 and decide genej gene 2 if 
t/tj + t 2 ) > S. 

At least two ways can be used to show the resulting network. First, it 
can be desirable to determine the edge intensity and the degree of confidence of 
direction of Bayes causality in the original genetic network. One can add the 
intensity to each edge in the original network. From this network, one can find 
the reliability of the original network. Second, one can superpose the bootstrap 
networks and original network. However, the superposed network contains 
edges that have small intensities. Therefore, we can set a certain threshold 
value and remove the edges whose intensities are under the threshold. While 
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setting the threshold remains a problem, it is just a visualization problem. We 
note that the superposed network may not hold the acyclic assumption, but 
much effective information are in this network. 

Real Data Analysis 

5 We illustrate the effectiveness of our methods through the analysis of 

Saccharomyces cerevisiae gene expression data. We monitored the 
transcriptional level of 5871 genes spotted on a microarray by a scanner. The 
expression profiles of over 400 dirsuptants were stored in our database. The 
standard deviation (SD) of the levels of all genes on a microarray was 

1 0 evaluated. The value of SD represents roughly the experimental error. In our 

data, we considered the value of 0.5 as the critical point of the accuracy of 
experiments. We evaluated the accuracy of those profiles based on the standard 
deviation of the expression ratio of all genes. 107 disruptants including 68 
mutants where the transcription factors were disrupted could be selected from 

15 400 profiles. 

It can be appreciated that other values of SD can be considered critical 
for accuracy. For example, SD values can be 0.4, 0.3, 0.2, 0.1, about 0.05, 
about 0.01, about 0.005, about 0.001, or any other value considered to produce 
information having a desired reliability. 

20 We used 100 microarrays and constructed a genetic network of 521 

genes from the above data. The 94 transcription factors whose regulating genes 
have been clearly identified were found. The profiles of the 521 genes 
controlled by those 94 factors were selected from 5871 profiles. 

Table 1 of Example 3 shows the gene pairs with high bootstrap edge 
25 intensities. "Inte. " and "Dire, " mean the bootstrap edge intensity and the 

degree of confidence of direction of B ayes causality, respectively. "F. M is the 
function of parent gene, i.e. and are respectively, induce and repress. If 
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we cannot decide whether the function is to induce or repress, we enter "0" in 
"F.". Figure 5 of Example 3 shows the resulting partial network by 
superposing 100 bootstrap networks. We denote the edge intensity by the line 
width, and the number next to the line is the degree of confidence of the 
5 direction of Bayes causality. 

From Table 1, we conclude the following: Over 60% gene pairs in the 
Table 1 agree with the biological knowledge. Most of the genetic sets, whose 
value of edge intensity equals 1, have the relation of "Related protein" in the 
YPD data base. Some other genetic sets, like ARO genes and PDR genes, have 

10 the same regulatory systems. These results indicated that there were some 

relations between genes which were previously unknown and revealed for the 
first time using the methods of this invention. We also notice that the other 9 
genetic sets have the relation "Functional genomics". The relation of 
"Functional genomics" indicated that some relations were revealed by previous 

15 information derived from large-scale, high-throughput experiments such as 

microarray analysis, designed to uncover properties of large groups of proteins. 

Studies on the regulation of the purine biosynthesis pathway in 
Saccharomyces cerevisiae revealed that all the genes encoding enzyme required 
for AMP de novo biosynthesis are repressed at the transcriptional level by the 

20 presence of extracellular purines. Two transcriptional factors, named Baslp and 

Bas2p, are required for regulated activation of the ADE genes (Daignan-Fornier 
and Fink, 1992) as well as some histidine biosynthesis genes (Denis et al., 
1998). Those purine biosynthetic genes, like ADE1, ADE4, ADE5,7 and 
ADE8, display GCN4-dependent expression in response to amino acid 

25 starvation (Rolfes and Hinnebusch, 1993). The starvation for purines stimulates 

GCN4 translation by the same mechanism that operates in amino acid-starved 
cells leads to a substantial increase in HIS4 expression, one of the targets of 
GCN4 transcriptional activation. Figure 5 indicates that those ADE genes and 
histidine biosynthesis genes are related with both BAS1 and GCN4! 
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A desirable feature of the methods of this invention involve the use of 
non-parametric heteroscedastic regression for capturing nonlinear relationships 
between genes and heteroscedasticity of the expression data. Therefore, our 
methods can be useful in the analysis of an unknown system, such as human 
5 genome, genomes from other eukaryotic organisms, prokaryotic organisms and 

viruses. 

V. Statistical Analysis of a Small Set of Time-Ordered Gene Expression 
Data Using Linear Splines 

Recently, the temporal responses of genes to changes in their 

10 environment has been investigated using cDNA microarray technology by 

measuring the gene expression levels at a small number of time points. 
Conventional techniques for time series analysis are not suitable for such a short 
series of time-ordered data. The analysis of gene expression data has therefore 
usually been limited to a fold-change analysis, instead of a systematic statistical 

1 5 approach. 

We use the maximum likelihood method together with Akaike's 
Information Criterion to fit linear splines to a small set of time-ordered gene 
expression data in order to infer statistically meaningful information from the 
measurements. The significance of measured gene expression data is assessed 

20 using Student's Mest. 

Previous gene expression measurements of the cyanobacterium 
Synechocystis sp. PCC6803 were reanalyzed using linear splines. The temporal 
response was identified of many genes that had been missed by a fold-change 
analysis. Based on our statistical analysis, we found that about four gene 

25 expression measurements or more are needed at each time point. These 

conclusions and further description can be found in Example 4 herein below. 
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1. Introduction 

In recent years, many cDNA microarray experiments have been 
performed measuring gene expression levels under different conditions. 
Measured gene expression data have become widely available in publicly 
5 accessible databases, such as the KEGG database (Nakao et aL, 1999). 

In some of these experiments, the steady-state gene express levels are 
measured under several environmental conditions. For instance, the expression 
levels of the cyanobacterium Synechocystis sp. PCC6803 and a mutant have 
been measured at different temperatures, leading to the identification of the 
1 0 gene Hik33 as a potential cold sensor in this cyanobacterium (Suzuki et aL, 

2001). 

In other experiments, the temporal pattern of gene expression is 
considered by measuring gene expression levels at a number of points in time. 
Gene expression levels that vary periodically have for instance been measured 

15 during the cell cycle of the yeast Saccharomyces cerevisiae (Spellman et aL 

1998). The gene expression levels of the same yeast species were measured 
during the metabolic shift from fermentation to respiration (DeRisi et al 1997). 
In those experiments, environmental conditions were changing slowly over 
time. Conversely, gene responses to an abruptly changing environment can be 

20 measured. As an example, the gene expression levels of the cyanobacterium 

Synechocystis sp. PCC 6803 were measured at several points in time after a 
sudden shift from low light to high light (Hihara, 2001). 

In cDNA microarray experiments, gene expression levels are typically 
measured at a small number of time points. Conventional techniques of time 

25 series analysis, such as Fourier analysis or autoregressive or moving-average 

modeling, are not suitable for such a small number of data points. Instead, the 
gene expression data are often analyzed by clustering techniques or by 
considering the relative change in the gene expression level only. Such a "fold- 
change" analysis may miss significant changes in gene expression levels, while 

30 it may inadvertently attribute significance to measurements dominated by noise. 
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In addition, a fold-change analysis may not be able to identify important 
features in the temporal gene expression response. 

Several techniques to analyze gene expression data, such as deriving 
Boolean or Bayesian networks, have been used in the past (Liang et al., 1998 
5 Akutsu et al, 2000; Friedman et al; 2000). Describing gene interactions in terms 

of a regulatory network is can be desirable, and developing a network model 
may benefit from gene expression data obtained for a large number of time 
points. However, data for large numbers of time points for a large number of 
genes is currently not available. Although the number of genes in a given 

1 0 organism may be on the order of several thousands, gene expression levels are 

often measured at only five or ten points in time. 

So far, a systematic method has been lacking to statistically analyze 
gene expression measurements from a small number of time-ordered data. We 
developed a strategy based on fitting linear spline functions to time-ordered data 

15 using the maximum likelihood method and Akaike's Information Criterion 

(Akaike, 1971). The significance of the gene expression measurements is 
assessed by applying Student's Mest. This allows us to infer information from 
gene expression measurements while taking the statistical significance of the 
data into consideration. This kind of analysis can be viewed as an additional 

20 step toward building gene regulatory networks. As an example, we reanalyzed 

the gene expression measurements of the cyanobacterium Synechocystis sp. 
PCC 6803 (Hihara, 2001). Using linear splines, information can be deduced and 
measured data that is missed by methods using the fold-change only. By 
repeating our analysis with a subset of the available data, we determined how 

25 many measurements are needed at each time point in order to reliably estimate 

the linear spline function. 

2. Methods 

Student's Mest 
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We assessed whether the measured gene expression rations are 
significantly different from unity. If for a specific gene, we can conclude that 
for all time points the measured expression rations are not significantly different 
from unity, we can eliminate that gene from further analysis. The significance 
5 level can be established by applying Student's Mest for each time point 

separately. Since multiple comparisons are being made for each gene, the value 
of the significance level a should be chosen carefully. 

We define Ho (l) as the hypothesis that for a given gene the expression 
ratio is equal to unity at a given time point t it and Ho as the hypothesis that for a 

1 0 given gene the expression rations at all time points are equal to unity. If we 

denote a as the significance level for rejection of hypothesis Ho, and a y as the 
significance level for rejection of hypothesis H 0 (l) , then a 5 and a are related via 
1-a = (1 - a') a , in which a is the number of time points at which the gene 
expression ratio was measured. Note that by expanding the right hand side in a 

15 first order Taylor series, this equation reduces to Bonferoni's method for 

adjusting the significance levels (see also Anderson and Finn, 1996). 

By performing Student's t-test at every time point for each gene using a' 
as the significance level, we will find whether Ho (l) and therefore Ho should be 
rejected. If Ho is not rejected, we can conclude that that gene is not 

20 significantly affected by the experimental manipulation, and should therefore 

not be included in further analyses. If for a given gene the null hypothesis Ho is 
rejected, we conclude that that gene was significantly affected by the 
experimental manipulation. 

25 Analyzing time-ordered data using linear splines 

Next, we analyzed temporal gene expression response for genes that 
were found to be significantly affected. The measured gene expression ratios 
formed a small set of time-ordered data, to which we fit a linear spline function. 
A linear spline function is a continuous function consisting of piecewise linear 
30 functions, which are connected to each other at knots (Friedman and Silverman, 
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1989; Higuchi, 1999). Whereas cubic splines are used more commonly, for the 
small number of data pints we can use linear spline functions more suitably. A 
conceptual example of a linear spline function with knots t*i, t* 2 , t\ t* 4 , is 
shown in Figure 1 of Example 4. We wish to fit a nonparametric regression 
5 model of the form xj — g(tj) + £j to these data, in which g is a linear spline 

function and Sj is an independent random variable with a normal distribution 
with zero mean and variance o 2 . 

We estimated the linear spline function g using the maximum likelihood 
method. The probability distribution of one data point xj given t j, is shown in 

1 0 equation 3 of Example 4. The log-likelihood function for the n data points is 

then given by equation 4 of Example 4. The maximum likelihood estimate of 
the variance o 2 can be found by maximizing the log-likelihood function with 
respect to o 2 . This yields equation 5 of Example 4. 

The fitted model depends on the number of knots q. The number of 

15 knots can be chosen using Akaike's Information Criterion, known as the AIC 

(Akaike, 1971; see also Priestly, 1994). For each value of q, we calculated the 
value of the AIC after fitting the linear spline function as described in Example 
4, and selected the value of q that yields the minimum value of the AIC The 
case q^l corresponds to linear regression. For the special case q=\, we 

20 effectively fit a flat line to the data. If we find that for a particular gene, the 

minimum AIC is achieved for the constant function (q = 1), then we can 
conclude that the expression level of that gene was unaffected by the 
experimental manipulations. Gene expression data are typically given in terms 
of expression rations. At time zero, the expression ration is equal to unity by 

25 definition. This fixed point can be incorporated easily in our methodology by 

modifying equation 7 of Example 4. The minimum value of D 2 can be achieved 
by choosing the linear spline function shown in equation 17 of Example 4. 

3. Results 
30 Student's r-test 
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We herein illustrate using Student's Mest and fitting linear spline 
functions, by reanalyzing measured gene expression profile of the 
cyanobacterium sp. PCC 6803 after a sudden exposure to high light (HL) 
(Hihara et al 9 2001). The expression levels of 3079 ORFs were measured at 
5 zero, fifteen minutes, hone hour, six hours, and fifteen hours both for 

cyanobacteria exposed to HL and cyanobacteria that remained in the low light 
(LL) condition. Table 1 of Example 4 shows the number of measurements at 
each time point. Data from the cDNA expression measurements were obtained 
from the KEGG database (Nakao et al., 1999). 

10 The data used for the original analysis (Hihara, 2001) may not be 

identical to the raw data submitted to KEGG (Hihara, personal communication). 
In addition, two measurement sets out of six at the t = 15 minutes time point are 
missing in the KEGG database. Recalculating the gene expression rations from 
the raw data gives numbers close to the previously published results. 

1 5 After subtracting the background signal intensities from the HL and LL 

raw data, global normalization was applied and the ration of the HL to the LL 
signal intensities was calculated to find the relative change in gene expression 
level with respect to the control (LL) condition. In the fold-change analysis, a 
gene was regarded as being affected by HL if its expression level changed by a 

20 factor of two or more. The statistical significance of such changes was assessed 

heuristically by considering the size of the standard deviation of the 
measurements (Hihara, 2001). 

The results of the Student's £-test on the gene expression rations of each 
gene separately are shown in Table 2. At a significance level of a = 0.001, 167 

25 genes were found to be significantly affected by HL condition. Note that we 

would expect about three type-1 errors among these 167 genes. In comparison, 
164 ORFs were found to be affected by the HL condition in the original analysis 
(Hihara, 2001). 

By considering the fold-change for the psbD2 gene (slr0927), it was 
30 concluded that it was not significantly induced by HL (Hihara, 2001). This was 
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remarkable, since this gene had been reported to be inducible by HL in the 
cyanobacterium Synechococcus sp. PCC 7942 (Bustos and Golden, 1992; 
Anandan and Golden, 1997). However, performing Student's Mest on the gene 
expression data for the psbD2 gene at t = 6 hours yields p = 3.3 x 10" 5 , 
5 suggesting that this gene was indeed affected by HL. 



Analysis Using Linear Spline Functions 

Next we fit linear spline functions to the measured gene expression 
ratios. The number of knots q were between one and five, with a fixed knot 

1 0 equal to unity at time zero. For q = 3 and q = 4, three possibilities exist for the 

placement of the knots between the linear segments of the linear spline. These 
are indicated in Figure 2 of Example 4, together with the cases q = \,q^2, and 
q = 5. Notice that the number of possible knot placements increases 
exponentially with the maximum number of knots q max . 

15 In the fold-change analysis, temporal gene expression patterns were 

classified into six categories (Hihara, 2001), listed in Table 3 of Example 4. 
Fitting a linear spline function to the measured gene expression data provided a 
more flexible way to describe the data than categorizing. In addition, a 
numerical description of the gene expression response pattern is an important 

20 step in deriving a gene regulatory network. 

Analyzing Expression Data Using Linear Splines 

We next illustrate the usage of the AIC with an example. In the fold- 
change analysis, the threonine synthase gene thrC {slll688) was found to be 
25 repressed at the approximately one hour. The calculated values of the AIC for 

the different sets of knots are listed in Table 4. The minimum AIC was 
achieved for knots at 0, 15 minutes, 1 hour, and 15 hours. Figure 3 of Example 
4 shows the measured gene expression levels, together with the linear spline 
that was fitted to the data. 
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Performing this procedure for all the gene expression measures let us 
classify different genes based on their time-dependent response to HL. Several 
choices can be made as to which genes to include in this analysis. In the 
original analysis, genes were removed from the calculation if their expression 
5 levels were among the 2000 lowest out of 3079 ORFs (Hihara, 2001). 

Alternatively, we can eliminate genes if Student's /-test showed that they were 
not significantly affected by HL. Table 5 shows the number of genes whose 
measured expression levels correspond to each pattern for these different cases. 
For Student's /-test, a significance level q = 0.001 was used. 

1 0 We compared the 167 genes which were identified by Student's /-test as 

significantly affected by HL with the results from the fold-change analysis 
(Hihara, 200 1), in which 164 ORFs were identified. First, we removed those 
genes from our analysis for which an outlier was present in the data. We define 
an outlier as a data point that deviates more than two standard deviations from 

1 5 the mean of the data at a given time point. Only one gene was found for which 

the measured expression data contained an outlier; the linear spline function 
fitted to its expression data was a flat line. None of the other gene expression 
levels was described by a flat line, which is consistent with the results from 
Student's /-test. 

20 Next, we removed those genes whose expression level was among the 

lowest 2000 in order to avoid using data that are dominated by noise. The same 
procedure had been used for the fold-change analysis (Hihara, 2001). After 
removal of these genes, 107 genes remained that were significantly affected by 
HL. 

25 Of the 107 genes, 42 had not been identified in the fold-change analysis 

(Hihara, 2001). These genes are listed in Table 6 of Example 4, together with 
the location of the knots that we found for each gene. For each linear spline 
function the percentage variance explained was calculated as a measure of the 
goodness of fit. As an example, Figure 4 shows the measured gene expression 

30 ration as well as the fitted linear spline function of the gene sylR (slr0329\ 
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having four knots at zero, fifteen minutes, one hour, and fifteen hours. Of the 
42 genes, the gene xylR (slr0329) had the largest percentage variance explained 
(98.7%). Of the 164 ORFs that were identified in the fold-change analysis, 39 
were not significantly affected by HL according to Student's Mest at a 
5 significance level q = 0.001 . These ORFs are listed in Table 7 of Example 4. 

Finally, we established whether the number of measurements at each 
time point was sufficient to reliably determine the placements of knots for the 
linear spline function. To do so, we repeated the estimation of the linear spline 
function. To do so, we repeated the estimation of the linear spline function 

1 0 using subsets of the measured data. We then counted for how many genes the 

estimated knot positions changed if a subset of the data was used instead of the 
complete set of data. The average and standard deviation of this number for 
four, three, and two data points at each time point is shown in Table 8. 

Even if only two data points (at the one hour time point) are removed, 

1 5 and four data points are used at each time point in 1 5% of the cases the 

estimated knot positions changed. Thus, in certain embodiments, four or more 
data points are needed for each time point to reliably deduce information from 
gene expression measurements. 

20 Discussion 

We have described a strategy based on maximum likelihood methods to 
analyze a set of time-ordered measurements. By applying Student's Mest to the 
measured gene expression data, we first established which of the measured 
genes are significantly affected by the experimental manipulation. The 
25 expression responses of those genes were then described by fitting a linear 

spline function. The number of knots to be used for the linear spline function 
was determined using Kike's Information Criterion (AIC). 

Using linear spline functions permits more flexibility in describing the 
measured gene expression than using a nominal classification. Also, to set up a 
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gene regulatory network, it can be desirable that the gene response as 
determined from gene expression measurements is available in a numerical 
form. Finally, positions of the knots specify those time points at which the 
expression of a gene changes markedly, which can be desirable in identifying its 
5 biological function. Classification of gene expression responses based on the 

position of the knots can be refined by creating subcategories that take the 
magnitude of the linear spline function at the knots into account. For instance, 
for linear spline functions with three knots, we may consider creating six 
subcategories in which changes in the gene expression level are described by 
10 (flat, increasing), (flat decreasing), (increasing, flat), (decreasing, flat), 

(increasing, decreasing), or (decreasing, increasing). 

Applying the technique of linear spline functions to measured gene 
expression data, we can identify the temporal expression response pattern of 
genes that were significantly affected by the experimental manipulations. The 

1 5 response of 42 of those genes was not noticed in earlier fold-change analyses of 

expression data. Furthermore, it was shown that the expression response levels 
found in a fold-change analysis were not found to be significant by Student's t- 
test for 33 out of 164 genes some genes. In some embodiments, gene 
expression data may be nosy and are plagued by outliers. Whereas Student's t- 

20 test and maximum likelihood methods described here take the statistical 

significance of noisy data into account, the issue of outliers needs to be 
addressed separately. As a simple procedure to remove outliers, we calculated 
the mean and standard deviation of the data for each point in time, and removed 
those data that deviate more than two standard deviations or so from the mean. 

25 Finally, the number of expression measurements needed at each time 

point to reliably fit a linear spline function was determined by removing some 
data points and fitting a linear spline function anew. It was found that if four 
data points per time point were used, in about 15% of the cases the knot 
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positions will not be estimated reliably. It is therefore advisable to make more 
than four measurements per time point. 

VI. Use of Gene Networks for Identifying and Validating Drug Targets 

5 We describe new methods for identifying and validating drug targets by 

using gene networks, which are estimated from cDNA microarray gene 
expression data. We created novel gene disruption and drug response 
microarray gene expression data libraries for the purpose of drug target 
elucidation. We used two types of microarray gene expression data for 

10 estimating gene networks and then identifying drug targets. Estimated gene 

networks can be useful in understanding drug response data and this 
information is unattainable from clustering analysis methods, which are the 
standard for gene expression analysis. In the construction of gene networks in 
certain embodiments, we used both Boolean and Bayesian network models for 

1 5 different steps in analysis and capitalize on their relative strengths. We used an 

actual example from analysis of the Saccharomyces cerevisiae gene expression 
and drug response data to validate our strategy for the application of gene 
network information to drug discovery. 

20 1 . Introduction 

Cluster methods have become a standard tool for analyzing microarray 
gene expression data. However, they cannot offer information sufficient for 
identifying drug targets in either a theoretical or practical sense. We provide 
methods to determine how estimated gene networks can be used for identifying 
25 and validating target genes for the understanding and development of new 

therapeutics. Gene regulation pathway information is desirable for our purpose, 
and we used both Boolean and Bayesian network modeling methods for 
inferring gene networks from gene expression profiles. The procedure for 
identifying drug targets can be separated into two parts. At first, identify the 
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drug-affected genes. Second, we search for the "target" genes, which are 
usually upstream of the drug-affected genes in the gene network. A Boolean 
network model is useful for identifying the drug affected genes by using "virtual 
gene" gene technique, described herein and a Bayesian network model can be 
5 used for exploring the draggable gene targets related to the elucidated affected 

genes. We applied our methods to novel Saccharomyces cerevisiae gene 
expression data, comprised of expression experiments from 120 gene 
disruptions and several dose and time responses to a drug. 

10 2. Gene Networks for Identifying Drug Targets 

A. Clustering Methods 

Clustering methods such as the hierarchical clustering and the self- 
organizing maps are commonly used as a standard tool for gene expression data 
analysis in the field of Bioinformatics. Eisen focuses on the hierarchical 

15 clustering and provides software, Cluster/TreeVew, for clustering analysis of 

gene expression data. De Hoon et al improved this software especially in the k- 
means clustering algorithm. 

Clustering methods only provide the information on gene groups via the 
similarity of the expression patterns. However, it can be desirable to have 

20 additional hierarchical pathway information, not only cluster information to 

detect drug targets that are affected by an agent. We show the limitations of 
clustering techniques for drug targeting purposes through real data analysis. 

We use two new methods for estimating gene networks from gene 
expression data. In this subsection, we give the brief introductions of both 

25 methods. For detailed discussions of the algorithms, please refer the papers in 

the references section. 

B. Boolean Network 

To estimate a boolean network model, we can discretize the gene 
30 expression values into two levels, 0 (not-expressed) and 1 (expressed). Suppose 
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that uu . . • u k are input nodes of a node v. The state of v is determined by a 
Boolean functions f u ($(uj), . . . , \p(u*)), where is the state or the 

expression pattern of uj. If we have time series gene expression data, the state 
depends on the time t and the state of the node at time t depends on the states of 
inputs at time t - 1. On the other hand, suppose that we have gene expression 
data obtained by gene disruptions. Akutsu et al. proposed the theory and 
methodology for estimating a Boolean network model without time delay. 
Maki et al. provides a system, named AIGNET, for estimating a gene network 
based on the Boolean network and an S-system. We use the AIGNET system 
for estimating Boolean network models. 

Advantages of the use of the Boolean network model includes: 
a) This model is simple and can be easily understood. A Boolean network 
model can detect the parent-child relations correctly, when the data has 
sufficient accuracy and information and b) the estimated Boolean network 
model can be directly applied to the Genome Object Net, a software tool for 
biopathway simulation. A disadvantage of Boolean methods is that data must 
be discretized into two levels, and the quantization loses information. 
Moreover, the threshold for discretizing is a parameter and should be chosen by 
a suitable criterion. 

C. Bayesian Networks 

A Bayesian network is a graphic representation of complex relationships 
of a large number of random variables. We consider the directed acyclic graph 
with Markov relations of nodes in the context of Bayesian network. We can 
thus describe complex phenomena through conditional probabilities instead of 
the joint probability of the random variables. Further description can be found 
in Example 5 herein. 

Friedman et al. proposed an approach for estimating a gene network 
from gene expression profiles. They discretized the expression values into three 
values and used the multinomial distributions as the conditional distributions of 
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the Bayesian network. However, this did not solve the problem of choosing 
threshold values for discretizing. 

We developed a nonpar ametric regression model that offers a solution 
that does not require quantization together with new criterion, named BNRC, 
5 for selecting an optimal graph. The BNRC is defined as the approximation of 

the posterior probability of the graph by using the Laplace approximation for 
integrals. We applied the proposed method to Saccharomyces cerevisiae gene 
expression data and estimated a gene network. The advantages of this method 
are as follows: a) one can analyze the micro arry data as a continuous data set, b) 

1 0 this model can detect not only linear structures but also nonlinear dependencies 

between genes and c) the proposed criterion can optimize the parameters in the 
model and the structure of the network automatically. 

A Bayesian network has some advantageous based on the mathematics 
of inference, and we use the method for constructing Bayesian networks. We 

1 5 can construct cyclic regulations and multilevel directional models of regulatory 

effects from data created from logical joins of expression data from disruptants 
and drug response experiments by combining the Bayesian and Boolean 
networks in analysis. Hence, the Boolean and Bayesian networks used together 
can cover the shortcomings one another and we can obtain more reliable 

20 information. 

3. Applications to Microarray Data 

We created two libraries of microarray data from Saccharomyces 
cerevisiae gene expression profiles. One was obtained by disrupting 120 genes, 

25 and the other was comprised of the response to an oral antifungal agent, (four 

concentrations and five time points). We selected 735 genes from the yeast 
genome for identifying drug targets. In YPD, 314 genes were defined as 
"Transcription factors", and 98 of these have already been studied for their 
control mechanism. The expression profile data for 735 genes chosen for 

30 analysis included the genes controlled by these 98 "Transcription factors" from 



-50- 



WO 03/027262 



PCT/US02/31093 



5871 gene species measured in addition to nuclear receptor genes which have a 
role in gene expression regulation and are popular drug targets. We constructed 
network models from the data set of 735 genes over 120 gene disruption 
conditions. The details of the disruption data are also described in Imoto et al. 

As for drug response microarray gene expression data, we incubated 
yeast cultures in dosages of 10, 25, 50, 100 mg of an antifungal medication in 
culture and took aliquots of the culture at 5 time points (0, 15, 30, 45 and 60 
minutes) after addition of the agent. Here time 0 means the start point of this 
observation and just after exposure to the drug. We then extracted the total 
RNA from these experiments, labeled the RNA with cy5, hybridized them with 
cy3 labeled RNA from non-treated cells and applied them to full genome cDNA 
microarrays thus creating a data set of 20 microarrays for drug response data. 
In this paper, we use these 140 microarrays to elucidate drug targets using gene 
networks. 

4. Results 

A. Clustering Analysis 

In the identification of the drug targets, a popular but problematic prior 
art strategy has been to use clustering analysis often even with a library of base 
perturbation control data to compare to (()). We have two types of microarray 
data, gene disruption and drug response, allowing us to compare drug response 
patterns to gene expression patterns caused by disruption. In the clustering 
analysis, if there is significant and strong similarity between the expression 
patterns of a single disruptant or group of disruptants and a given drug response 
microarray, we may conclude that an agent probably plays the same role as the 
disrupted gene(s). Moreover, if this disrupted gene has know functional role, 
we may obtain more information about the response to a drug. 

Unfortunately, as if often the case with such experiments we cannot gain 
such a straightforward result from clustering our data. Figure 1 shows the 
image representation of the correlation matrix of our microarray data. We 
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combine the two types of data and make the matrix Z = (X: Y) 9 where X and Y 
are the drug response and the gene disruption microarray data matrix, 
respectively. Here, each column denotes an expression pattern obtained by one 
microarray, and each column is standardized with a mean of 0 and variance of 
5 1 . Therefore, Figure 1 of Example 5 depicts the information of the correlation 

matrix R — Z T z /p, where p is the number of genes. To illustrate our methods, 
we focus on 735 genes for estimating gene networks and identifying drug 
targets. 

In Figure 1 of Example 5, the light and dark colors denote the positive 
1 0 and negative high correlations, respectively. Drug response microarrays have 

high correlation amongst each other and a low correlation with any of the gene 
disruption microarrays. Under such a situation, it may be difficult to identify 
interactions between gene disruptions and drug responses from the clustering 
analysis that would be meaningful in the context of drug response. We further 
1 5 implemented hierarchical clustering of the drug response microarrays, but this 

produced one cluster and we could not extract any more information on the 
actual drug targets from this analysis. This result was essentially unchanged 
when we use other distance measurements or clustering techniques. Hence, it is 
difficult to obtain information for identifying meaningful drug targets from the 
20 clustering methods. 

B. Boolean Network Analysis 

To overcome shortcomings of prior art clustering methods, we estimated 
gene network by using the micro any data, Z, which was created by combining 
gene disruption and drug response microarrays. We consider the conditions of 
25 the drug responses data as "virtual genes", e.g., the condition lOOmg/ml and 

30min is given an assignment as the gene YEXP 1 00mg3 Omin. By using a 
Boolean network model, we found child genes of these virtual genes, with the 
drug affecting these child genes in progeny generational order. We focus on 
genes which five or more virtual genes as the parent genes, as the putative drug- 
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affected genes, that is, genes which are under direct influence of the virtual 
genes (drug affect). However, a gene which has only one virtual gene as its 
parent gene may be the primary drug-affected gene, depending on the mode of 
action for a given agent. The virtual gene technique highlights the use of 
Boolean network models compared with the Bayesian network model in the 
initial screening for genes under drug-induced expression influence. 

In addition, fold-change analysis can provide similar information to the 
proposed virtual gene technique. We identified the affected genes under certain 
experimental conditions by fold-change analysis. However, our virtual gene 
technique can improve the result of the fold-change analysis. Suppose that we 
find gene A and B are affected by the drug from fold-change analysis. The fold- 
change analysis cannot take into account the baseline interactions between 
geneA and geneB. That is, if there is a regulation pathway between geneA and 
geneB that geneA — » geneB, the geneB may not be affected by the drug directly. 
Rather, an effect of the drug on geneA may result in an indirect effect on geneB. 
The virtual gene technique can take into account such interaction by using the 
information of the gene disruption data and thus reduce the search set to more 
probable target genes given available interaction data. 

There is not guarantee that genes that are most affected by an agent are 
the genes that were "drugged" by the agent, nor is there any guarantee that the 
drugged target represents the most biologically available and advantageous 
molecular target for intervention with new drugs. Thus, even after identifying 
probable molecular modes of action, it is desirable to find the most draggable 
target genes upstream of the drug-affected genes in a regulatory network and to 
then screen low molecular weight compounds for drug activity on those targets. 
In the estimated Boolean network, virtual genes can be placed on the top of the 
network. Therefore, it can be difficult or sometimes impossible to find 
upstream information for drug-affected genes in this estimated Boolean 
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network. In such circumstances, we use a Bayesian network model for 
exploring the upstream region of the drug-affected genes in an effective manner. 

C. Bayesian Network Analysis 

We found that a gene network can be estimated by a Bayesian network 
5 and nonparametric regression method together with BNRC optimization 

strategy. We use the Saccharomyces cerevisiae microarray gene expression 
data as described herein obtained by disrupting 120 genes. From Boolean 
network analysis, we found the candidate set of the drug-affected genes 
effectively. Draggable genes are drug targets related to these drug-affected 

1 0 genes, which we want to identify for the development of novel leads. We 

explored the druggable genes on the upstream region of the drug-affected genes 
in the estimated gene network by a Bayesian network method. Using a 
Bayesian model of network regulatory data available to us from our knockout 
expression library, we searched upstream of drug affected target genes, which 

15 have a known regulatory control relationship over drug affected target 

expression. For example, we focus on nuclear receptor genes as the druggable 
genes because a) nuclear receptor proteins are known to be useful drug targets 
and together represent over 20% of the targets for medications presently in the 
market, b) nuclear receptors are involved in the transcription regulatory affects 

20 that are directly measured in cDNA microarry experiments. 

Figure 3 of Example 5 shows a partial network, which includes the drug- 
affected genes (Bottom), the druggable genes (Top) and the intermediary genes 
(Middle). Of course, we can find more pathways from the druggable genes to 
the drug-affected genes if we admit more intermediary genes. Due to the use of 
25 the Bayesian network model, we can find the intensities of the edges and can 

select more reliable pathway. This is an advantage of Bayesian network models 
in searching for suitable druggable targets. In Figure 3 of Example 5, druggable 
genes in the circle connect directly to the drug-affected genes and the other 
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draggable genes have one intermediary gene per one druggable gene. From 
Figure 3, we identified druggable genes for each drug-affected gene, e.g., we 
found the druggable genes for MAL33 and CDC6 shown in Table 1 of Example 
5. 

5- Discussion 

We describe new strategies for identifying and validating drug targets 
using computational models of gene networks. Boolean and Bayesian networks 
can be useful for estimating gene networks from microarray gene expression 
data. Using both methods we can obtain more reliable information than by 
using either method above. A Boolean network is suitable for identifying drug- 
affected genes by using the virtual gene technique set forth herein. Bayesian 
network models can provide information for upstream regions of the drug- 
affected genes and we can thus attain a set of candidate druggable genes. Our 
novel strategies are established based on the sophisticated use of a combination 
of two network methods. The strength of each network method can be clearly 
seen in this strategy and the integrated method can provide a methodological 
foundation for the practical application of bioinformatics techniques for gene 
network inference in the identification and validation of drug targets. 

VII. Drug Target Discovery and Validation with Gene Regulatory 
Networks 

Gene regulatory networks developed with full genome expression 
libraries from gene perturbation variant cell lines can be used to quickly and 
efficiently to identify the molecular mechanism of action of drugs or lead 
compound molecules. We developed an extensive yeast gene expression library 
consisting of full-genome cDNA array data for over 500 yeast strains each with 
a single gene disruption. Using this data, combined with dose and time course 
expression experiments with the oral antifungal agent Griseofulvin, we used 
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Boolean and Bayesian network discovery techniques to determine the genes 
whose expression was most profoundly affected by this drug. Griseofulvin, 
whose exact molecular target was previously unknown, interferes with mitotic 
spindle formation in yeast. Our system was used to directly discover CIK1 as 
5 the primarily affected target gene in the presence of Griseofulvin. Deletion of 

CIK1 9 the primary affected target determined by gene network discovery 
produces similar morphological effects on mitotic spindle formation to those of 
the drug. Using the base hierarchical data from the expression library and 
nonparametric Bayesian network modeling we were able to identify alternative 
1 0 ligand dependant transcription factors and other proteins upstream of OKI that 

can serve as potential alternative molecular targets to induce the same molecular 
response as Griseofulvin. This process for network based drug discovery can 
significantly decrease the time and resources necessary to make rational drug 
targeting decisions. 

15 

1. Introduction 

Rational drug design methodologies have previously been concentrated 
on optimizing small molecules against a predetermined molecular target. The 
randomized lead to target to phenotype screening for target selection that is 

20 currently the prevailing paradigm in drug discovery has failed to offer a more 

efficient and accurate target selection process even with the advent of wide 
scale availability of genomic information and high throughput screening 
processes (lj2,3) . The availability of genomic sequences, full genome 
microarrays and recent advances in gene network inference computational 

25 techniques allows for a new rational paradigm for drug target selection that 

takes into account global networked regulatory interactions among molecules in 
the genome. Accurate models of gene regulatory network data can be produced 
from disruptant based expression data (4,5) by using various computational 
inference techniques (6,7) . Here we show how this gene regulatory information 

30 can be used to quickly determine the molecular networks and gene targets 
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affected by a given compound. The same information allows for the 
identification and selection of alternative draggable molecular targets upstream 
or downstream of a drug targeted molecule in the gene expression regulatory 
cascade. 

We have developed a gene regulatory network-driven iterative drug 
target discovery process. In this methodology, first large numbers of gene 
expression experiments are performed on single gene disruptant cell lines. This 
information is used to create computationally inferred maps of hierarchical gene 
expression control. The hierarchical regulatory information is used as a basis 
for evaluation of drug response experiments and for generation of hypotheses of 
molecular action mechanisms. Information from the literature and further 
biological experimentation on the elucidated regulatory subnetworks is used to 
understand and validate results before selecting a candidate molecule for drug 
targeting. 

Most effective drugs in clinical use including aspirin and other popular 
medications have not been rationally designed to interact with a specific 
molecular target. Thus, even when the desired clinical effect or phenotype is 
achieved with these drugs, the underlying molecular mechanisms of action and 
thus the mechanisms of the drag's side effects remain unknown. Full genome 
gene expression experiments have been shown to be useful in determining 
alternative genes and pathways affected by a drag (8j9>10 \ but determination of 
the primary molecular target for many drugs which affect hundreds of genes 
with standard gene expression analysis methods such as clustering is impractical 
without apriori information on potential targets for the drug. Here we 
demonstrate the use of hierarchical gene expression regulation networks from 
full genome expression libraries and gene network modeling techniques 
together with drag response expression experiments to determine the previously 
underlying molecular targets for the popular generic antifungal agent, 
Griseofulvin. 
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Griseofulvin is a widely prescribed oral antifungal agent that is 
indicated primarily for severe fungal infections of the hair and nails. While 
Griseofulvin' s molecular action is unknown, the drug disrupts mitotic spindle 
structure in fungi to lead to metaphase arrest. 

5 

2. Methods 

Microarray experiments 

The yeast strain used in this study is BY4741. To monitor the gene 
expression profile, cells were pre-grown at 30 °C in YPD (2 % polypeptone, 1 
10 % yeast extract and 2 % glucose) to mid-exponential phase and were exposed 

with Griseofulvin by adding it to the medium to the concentrations at 0, 10, 25, 
50, and 100 mg/ml. The exposed cells were harvested at 0, 15, 30, 45 and 60 
min after addition by Griseofulvin and used for RNA extraction. Total RNAs 
were extracted by the hot-phenol method. 

15 

Boolean Network Inference Algorithms 

In addition to the Boolean methods for gene disruption experiments 
reported elsewhere (n) , we created a gene expression matrix E from a set of the 
drug treatment experiments combined with disraptant expression data. In the 
20 case of drug generated perturbation, we created a "virtual" gene to represent the 

drug affect analogous to a gene disruption. We then were able to use standard 
disruption matrix algorithms designed for disruption based data. 

Bayesian Network Inference Algorithms 

25 We performed non-parametric regulation and Bayesian networks to 

define regulatory subnetworks upstream of OKI using algorithms and methods 
reported elsewhere herein. 

Data Normalization 



-58- 



WO 03/027262 



PCT/US02/31093 



We measured the quantities of 5871 mRNA species from 20 drag treated 
strains by cDNA microarray assay. A difference in fluorescent strength 
between Cy3, Cy5 causes bias of the expression quantity ratio. We normalized 
the expression quantity ratios of each expression profile. The ratio bias had a 
fixed trend in each spotted block, thus we calculated a linear regression to 
normalize the mean value ratio of each block to 1.0. The logarithm value of the 
ratio was used to indicate the standard expression level, therefore we found the 
logarithm value of ratio and calculated the average and standard deviation of 
these log values (see tablel). The Standard Deviation (SD) of expression levels 
of all spotted genes from the UME6 (YDR207Q disruptant expression array for 
which UME6 is defined as a "Global Regulator" in YPD (33) disruptant was 
0.4931, therefore we recognized that there are an unacceptable number of errors 
in array data whose overall SD was larger than 0.5 and eliminated such 
experiments from analysis. 

Selection of Genes for Modeling 

In YPD, 314 genes were defined as "transcription factors", and 98 of 
these have previously been studied for control mechanism. The expression 
profile data of 552 genes including the genes that are controlled by these 98 
"transcription factors" were selected from 5871 profiles. Thus we constructed 
the gene regulatory network from the expression profile data set based on the 
values of these 552 genes in 120 gene disruption experiments and 20 drug 
treatment experiments. 

3. Results 

We incubated yeast cultures in dosages of 10, 50 and 100 mg in 10ml 
of DMF and took aliquots of the culture at 5 time points (0, 15, 30 , 45 and 60 
minutes) after addition of Griseofulvin. We then extracted the total RNA from 
these experiments, labeled the RNA with cy5, hybridized them with cy3 labeled 
RNA from non-treated cells and applied them to full genome cDNA 
microarrays. 183 genes were downregulated by over a 2 sigma threshold 
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among 552 genes which differed in expression between drug treated and normal 
yeast (Figure 7a). Standard hierarchical clustering methodologies (Figure 7b) 
applied to the combined expression libraries from drug response and gene 
disruption experiments, clustered genes into two major groups; first, genes 
affected by Griseofulvin and second, genes affected by disruption. Within the 
Griseofulvin clusters, genes were further grouped by dosage or time course. 
However, clustering did not discover any gene expression patterns that 
significantly indicated correlated regulation by a given gene and the drug 
Griseofulvin. This result would be expected except in cases where the 
antifungal agent affects the expression of only one discrete gene and minimal 
(Figure 7b). 

However, the use of gene regulatory network models, combined with 
the drug perturbation data allows for a hierarchical gene regulatory view of the 
drug's interaction with genes in the transcriptome. To generate this gene 
network drug perturbation data, we first created a full genome expression 
library of 542 single-gene disruption mutants. The 120 array data selected from 
the library was logically joined with the array matrices generated from the 
Griseofulvin experiments. A Boolean methodology designed for gene network 
elucidation 0 1,1 2) was then applied to the joint expression matrices for each time 
course and hierarchical regulatory maps for each experiment were generated for 
each dose and time point. We produced joint Boolean regulatory subnetworks 
for each dosage and time point experiment. The Boolean algorithm was 
selected for its suitability for handling joint matrices, its ability to handle looped 
regulatory processes and the ease creation of hierarchical directed graphs with 
several orders of regulatory separation. 

From this data we were able to identify the first order drug affects as opposed to 
secondary cascades initiated by those initial perturbation regulatory events. 

By evaluating the Boolean data from each time and dose differential 
experiments we were able to identify 8 genes that were consistently and 
significantly suppressed as first effects at each time and drug concentration. Of 
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these genes, OKI exhibited the strongest suppression effects across the 
experiments. CIKJ codes for a protein described in the yeast proteome database 
(YPD) as a coiled-coil protein of the spindle pole body involved in spindle 
formation and the congression (nuclear migration) step of karyogamy (13) . Since 
the action of Griseofulvin is known to affect mitotic spindle formation, which is 
in agreement with the function of CIKJ (U \ we performed a pathological 
examination of a yeast strain with a disrupted CIKJ gene and yeast affected by 
Griseofulvin. While neither the treatment with Griseofulvin at normal 
physiological dosage nor the disruption of CIKJ are lethal, both cultures show 
similar morphological differences and growth characteristics Furthermore, 
microscopic examination of the mitotic spindle structure in Griseofulvin treated 
yeast and CIKJ deleted yeast show very similar changes of the spindle body and 
surrounding organizational structures (15) . 

The methodologies described here clearly demonstrate the utility of a 
combined expression array and computational approach using gene network 
techniques to rapidly ascertain and validate the molecular mechanisms of action 
of a given compound on a cell. Use of such techniques will help to rationalize 
the target selection process of pharmaceutical development in the post-genomic 
era and could contribute to efficiency of discovery and a reduction in 
development risk for the pharmaceutical industry. The same techniques can 
further be applied to other biological discovery and agrochemical targeting. 
Our laboratories are currently replicating this discovery model in human and 
other biological systems. Additional descriptions can be found in United States 
Provisional Patent Application serial no. 60/395,756, filed July 12, 2002 
incorporated herein fully by reference. 
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VII. Systems for Discovering Network Relationships and Uses Thereof 

Li other embodiments, methods for elucidating genetic networks are 
15 provided that include Example 6 described herein. Accordingly, a desirable 

system can include those in which: 

(1) one can collect experimental data for enabling the network to be 
elucidated when the genome structure is elucidated; 

(2) data of all the related genes can be measured; 

20 (3) many tools for enabling many experiments to be used, such as gene 

chips; 

(4) an output is measured after applying a disturbance to obtain many 
standardized data; and 

(5) analysis of the genetic relationships can be determined. 

25 

An example of such a system is presented in Figure 1 of Example 6. Figure 
2 of Example 6 depicts schematically, a method for obtaining microarray data 
on expressed genes from an organism. Figure 3 of Example 6 depicts 
schematically a method for assessing and quantifying the expression of a gene 
30 by comparing the amounts of gene products (e.,g., RNA) from mutated cells 
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(dismptants) in which a particular gene is disrupted with the amounts of gene 
products from normal (wild-type) cells. 

Analysis of a large scale network can be accomplished, in some 
embodiments, using the guideline provided as Figure 4 of Example 6. Time 
course studies can be carried out and the expression of each gene under study 
can be evaluated. A Boolean network model can be provided and a dynamic 
model of the network can be made. Positive and negative interactions can be 
mapped, thereby producing a gene network. A multilevel diagraph approach 
can be taken to relate effects of disrupting (or altering expression) each gene 
with respect to other genes under study. 
Incorporation by Reference 

All patents, patent applications and references cited in this application 
are incorporated herein fully by reference. 

The foregoing description of embodiments of the present invention has 
been provided for the purposes of illustration and description. It is not intended 
to be exhaustive or to limit the invention to the precise forms disclosed. Many 
modifications and variations will be apparent to the practitioner skilled in the 
art. The embodiments were chosen and described in order to best explain the 
principles of the invention and its practical application, thereby enabling others 
skilled in the art to understand the invention and the various embodiments and 
with various modifications that are suited to the particular use contemplated. It 
is intended that the scope of the invention be defined by the following claims 
and their equivalence. 

EXAMPLES 

The following Examples are provided to illustrate embodiments of this 
invention. Other specific applications of the teachings in this patent application 
can be used without departing from the spirit of this invention. Other 
modifications of the methods can be used, and are considered to be within the 
scope of this invention. 
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Example 1 



ESTIMATION OF GENETIC NETWORKS AND FUNCTIONAL 
STRUCTURES BETWEEN GENES BY USING BAYESIAN 
NETWORKS AND NONPARAMETRIC REGRESSION 



We propose a new method for constructing genetic network from gene expression 
data by using Baysian networks. We use nonparametric regression for capturing 
nonlinear relationships between genes and derive a new criterion for choosing the 
network in general situations. In a theoretical sense, our proposed theory and 
methodology include previous methods based on Bayes approach. We applied pro- 
posed method to the S. cerevisiae cell cycle data 13 ' 20 and showed the effectiveness 
of our method compared with previous methods, 

1 Introduction 

The improvement of the genetic engineering provides us enormous amount 
of valuable data such as microarray gene expression data. The analysis of the 
relationship among genes has drawn remarkable attention in the field of molec- 
ular biology and Bioinformatics. However, due to the cause of dimensionality 
and complexity of the data, it will be no easy task to find structures, which 
are buried in noise. To extract the effective information from biological data, 
thus, theory and methodology are expected to be developed from a statistical 
point of view. Our purpose is to establish a new method for understanding 
the relationships among genes clearer. 

In the analysis of the microarray gene expression data, Bayesian networks 
are adopted for constructing genetic networks 3 » 4 > 5 > 12 from a graph- theoretic 
approach. Friedman and Goldszmidt (1998) 12 proposed an interesting method 
for constructing genetic links by using Bayesian networks. They discretized 
the expression value and considered to fit the models based on multinomial 
distributions; However, a problem still remains to be done in choosing the 
threshold value for disctritizing by not only the experiments. The threshold 
value assuredly gives essential changes of the results and unsuitable threshold 
value leads to wrong results. On the other hand, recently, Friedman et ah 
(2000) 13 pointed out that discretizing was probably loosing the information. 
To use the expression data as continuous values, thus, they considered the use 
of Gaussian models based on linear regression. However this model can only 
detect linear dependencies and we cannot obtain sufficient results. 

In this paper we propose a new method for constructing genetic networks 
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by using Bayesian networks. To capture not only linear dependencies but also 
nonlinear structures between genes, we use nonpar ametric regression models 
with Gaussian noise 1 1>15 » 21 > 22 . Nonpar ametric regression has been developed in 
order to explore the complex nonlinear form of the expected responses without 
the knowledge about the functional relationship in advance. Due to the new 
structure of the Bayesian networks, a suitable criterion is needed for evaluating 
models. We derive a new criterion from Bayesian statistics. By using proposed 
method, we will overcome the defects of previous methods and attain more 
effective information. In addition, our method includes the previous methods 
as spacial cases. We demonstrate our proposed method through the analysis 
of the S. cerevisiae cell cycle data. 

2 Bayesian Network and Nonpar ametric Regression 

Let X = (XijX2y..- y Xp) T be a p-dimensional random vector and let G be 
a. directed acyclic graph. Under the Bayesian network framework, we look 
upon a gene as a random variable and decompose the joint probability into the 
product of conditional probabilities, that is 

p(x j9 x 2 ,-. 9 x r ) = p(x 1 \r 1 )P(x 2 \r 2 ) x ... x p(x p \p p ), (i) 

where JPj = (P^ , P%p , P$p) T is a g^-dimensional vector of parent variables 
of Xj in the graph G. - 

Suppose that we have n observations Xj, i n of the random vector X 
and the observations of Pj are denoted by where p { j is a q$- 

dimensional vector with Ar-th element p$, for k = For example, 

let X n be an n x p matrix, where X n = (xj , x„) T = (x(j), ...,X( p )) = 
(^ii)i=i,... > nu=i^..,P5 = (*ii>— y *i P ) T y = («ii,—,afni) T and a>T is the 
transpose of the vector x f . If X x have a parent vector i^i = (X 2 ,X 3 ) T , we 
have = (xi2,xi 3 ) T , = (x**,*,*) 7 - 

It is immediately found that the equation is held when we replace the 
probability measure P in (1) by densities 

f(x ily x i2 \„^x ip ) = /i(xa|Pa)/2(tta(Pa) x x / F (^ P !p t> ). 

Then all we need to do is to consider how construct the conditional densities 
fj(*ij\Pij) U = 1, -->*>)- 

In this paper, we use nonpar ametric regression models for capturing the 
relationship between x iS and p i} = (p£p, .»,2>i2]) T in the form 

ar ;i = mi (p£>) + m 2 (p < £ } ) + ~--hm qj (p£>) + e ii? i = I, n; i = 1> 
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where (k = I,..., qj) are smooth functions from R to R and eij (i = 1, ...,n) 
depend independently and normally on mean 0 and variance cr?. For the 
function m^, it is assumed that 

M iJt 

m*(pS?)== £7^Mi(p!i } ), « = l,~,n; k = l,~,», . (2) 

77*= 1 

where {6^ ,...,6^. fcJfe } is a prescribed set of basis functions (such as Fourier 
series, polynomial bases, regression spline bases, JE?-spline bases, wavelet bases 
and so on), the coefficients ~»7jj#. fc jfe are unknown parameters and M jk is 
the number of basis functions. 

Then a nonparametric regression model can be written as a probability 
density function in the form 



SJ —J ■ < 3 > 



where -yj- = (-y£ , ^nff gj ) T is a parameter vector with 7 ifc = (7^, : »,Tj^ fc fc) T - 
If a variable ~Xj has not parent variables, we consider the model based on the 
normal distributions with mean fij and variance <r|. 

- Finally we have a Bayesian network model based on the nonparametric 
regression model with Gaussian noise 

. p 

where = (&T> — » &p) T }S a parameter vector included in the graph G and Oj 
is a parameter vector in the model f jy i.e., Oj = {jfyvj) 7 or 0j = 0ii,<r|) T . 

3 Proposed criterion for choosing graph 

Let tt(#<?|A) be the prior distribution on the unknown parameter vector 0 G 
with hyper parameter vector A and let log7r(0G?|A) = O(n). The marginal 
probability of the data X n is obtained by integrating over the parameter space, 
and we choose a graph G with the largest posterior probability 

n G f f[f{^e G )^{e G \X)de Gy (4) 
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where tt g is a prior probability of G. Friedman and Goldszrnidt (1998) 12 con- 
sidered the multinomial distribution as the Bayesian network model /(a^; 0 G ) y 
and also supposed the Dirichlet prior on the parameter 0 G . In this case, the 
Dirichlet prior is the conjugate prior and the posterior distribution belongs to 
the same class of distribution. Then a closed form solution of the integration 
in (4) is obtained, and they called it BDe score for choosing graph 6 ' J 6 . Recall 
that the BDe score is confined to the multinomial model, and we propose a 
criterion for choosing graph in more general and various situations. 

The essential problem of constructing criteria based on (4) is how compute 
the integration. Some methods can be considered for computing the integra- 
tion such as Markov chain Mote Carlo, we use the Laplace approximation for 
integrals 7,17 ' 23 . The Laplace approximation to the marginal probability of X n 
is 

fflf{^O G M0a\X)de a = fexp{nl*(0 G \X n )}dO G 

\J\(VG)Y f 

where r is the dimension of 0 Gy 

Ix{0g\X„) = -^}og/(x j; eG) + -log7r(0 G |A), 
n n - 

Jx(Og) = ^a 2 {h(o G \x n )}/d9Gd&% 



and 0 G is the mode of lx(9 G \X n ). Then we have a criterion, BNRC, for 
selecting graph 

BNRC(G) = -21ogj*oy*II^ 

= -2Iog7r G - rlog(27r/n) + \og\J x {9 G )\ - 2nl x {0 G \X ri y (5) 

The optimal graph is chosen such that the criterion BNRC (5) is minimal. 

This criterion is derived under Iog7r(0 a |A) = O(n). If]og7r(#a|A) = O(l), 
the mode 0 G is equivalent to the maximum likelihood estimate, MLE, and the 
criterion is resulted in Bayesian information criterion, known as BIG 19 by 
removing the higher order terms Q(n~ j ) (j > 0). Konishi (2000) 18 provided 
a general framework for constructing' model selection criteria based on the 
KuUback-Leibler information and Bayes approach. 
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BNRC 



It is assumed that the prior density ir(6 G \X) is decomposed into the prod- 
uct of the prior densities on 9 j, 7r G (0 G |A) = TTi (0, |Aj) x • - - x w P (0 P \\ P ). Hence 
lx(0G\X n ) and log|J.*(#G)| in (5) are, respectively, 

where 

(^-|X n ) = - X) log jpy ; 0,-) + i log ^ |Aj). (6) 

Thus the BNRC (5) can be obtained by the local scores of graph as follows: 
We define the local BNRC for the /-th variable X, by 

3 i - -2 log | ?Jtj ^ n ^(^lA^d^j , (7) 

where tti^ is a prior probability of the local structure associated with Xj- We 
also apply Laplace's method to the BNRC,- and the BNRC is obtaind by 

p 

BNRC = ~21og7rGr + ^{BNRCy + 2!ogw ri }. 

Notice that the final graph is selected as a minim izer of the BNRC and not 
have to be minimize each local score BNRC/, because the graph is constructed 
as acyclic. 

4 Estimating graph and related structures by using BNRC 

In this Section we express our method in more concrete terms. Essential points 
of our proposed method are the use of the nonparametric regression and the 
new criterion for choosing graph from Bayesian statistics. 

As. for nonparametric regres- 
sion in Section 2, we* use the J3- 
splines 8 as the basis functions 
in (2). Figure 1 is an exam- 
ple of ^-splines of degree 3 with 
equidistance knots £i, ...,£io- We 

place the knots dividing the domain t \ ' \/ 4 x £ x 5 * 4 25 ^ " " £ ^ 

[min,*(p^' ) ),maXi(p^' ) )] into Mjk ~ 3 Figrure 1: Example of 6 B-spKnes of degree 3. 

equidistance interval 10 and set M jk *» ' — * 10 ^ ^^^eT^ ^ " 
J?-splines of degree 3. 
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We assume that the prior distribution on the parameter vector 0j is 

Si 



Each prior distribution ir jk ('y jh \\j k ) is a singular M jk variate normal distribu- 
tion given by 

where A,* is a hyper parameter, Kj* is an Mj k x Af,-* matrix, ^J k ^jknfj k — 

Efc3 fc (7/ ( fc } - 2 7j- } i,fc + -r { i-2,k) 2 ^ d U0*l+ is the product of M Jk - 2 nonzero 
eigenvalues of K jk ! The score BNRCj (7) can be obtained as 

n 9i 

BNRC,- = -^ldgTrjD, -2^1og/(x ii |p ii ;& i )-2^1og7r fc (ir iJb lA ifc ) 



+ log 



•(^Af/jb + l)^^- 1 ), (9) 



Where = (7j\£|) r is a mode of l^{0j\X n ) defined in (6) for fixed A jfc - 
For computational aspect, we approximate the logarithm of the determinant 
of the Hessian matrix in (9) by 

^^{log|J5^^ fc + nd?Xj k Kjk\ — Mj k log(ncr|)} — log(2<rj), 

where B jk is an n x M jk matrix defined by B jfc = {bj k (p$) y ~-,bj k (p^l)) T 
with bjtipg) = (l$(p&>), -.,^ Jfe (p^ )) T . Hence combining (3), (8) and 
(9), BKRC,- is resulted in 

BNRCy = Cj + (n - 2g,- - 2) log a? 

where pj k = a|Ajjt is a hyper parameter, 

C 3 = -2 log tt^ + (n4 Mj. - 2^) log(27r) + n - log 2 
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— 2(Mj. - fc-yiogn - X! lo S 



By using the backfitting algorithm 15 , the modes 7 iJk (k = 1, can be 

obtained when the values of ft* are given. -The backfitting algorithm can be 

expressed as follow: 

Step 1 Initialize: -y jk = 0 7 k~ 

Step 2 Cycle: k = 

TiJfc = (B^B,-* + nP jk K jk r l Bf k {x 01 -X *i*fi*>- 

Step 3 Continue Step 2 until a suitable convergence criterion is satisfied. 
The mode aj is given by 5] = ||x (i) - #ifcTjfcll 2 / n - 

In attention, the modes j jk and a] are depend on the hyper parameters 
j3 jk and we have to choose the optimal values of /3 jk - In the context of our 
method, it is a natural way that the optimal values of /3 jk are chosen as the 
minimizers of BNRCj. 

Recall that the B-splines coefficients vectors y jk are estimated by max- 
imizing (6). The modes of (6) are the same as the penalized likelihood es- 
timates 21 » 24 and we can look upon the hyper parameters X jk or j3 jk as the 
smoothing parameters in penalized likelihood. Hence, the hyper parameters 
play an important role for fitting the curve to the data. 

X t = X% + 2 sin(Xs) - 2X T + ci 
X-z = {1 + exp("4X3)} _1 + e 2 

Xs = Cs t Xfl = Ce, Xg = 

X 4 = Xf/3 + e 4 , X s = X*-X% + es 




Figure 2: Monte , Carlo simulation. 
(Left) true, (Right) estimate. 



{-l+e 7 , X 8 < -0.5 
■X* + ff, -0.5 <X»< 
1 -f «tt, 0.5 < X 9 
Xs = exp(-X 4 - l)/2+ e» 
X10 = cos(Xs) + eio~ 



0.5 



5 Computational experiments 

Before analyzing the real data, we used Monte Carlo simulations to examine 
the properties of our method. The data were generated from artificial graph 
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and structures between variables (Figure 2) and then the results can be sum- 
marized as follows: Proposed criterion BNRC can detect linear and nonlinear 
structure of the data very well. But BNRC score have a tendency toward, over- 
grow of graph. In the analysis of the real data, then, we consider the use of 
Akaike's information criterion known as AIC 1 ' 2 and use both methods. AIC 
was originally introduced as a criterion for evaluating models estimated by 
maximum likelihood method. But the estimates by our method is the same as 
the maximum penalized likelihood estimates and is not MLE. In this case, the 
modified version of AIC 10 is given by 

AIC - -2^1og/(x ii b, i ;7 i ,a|) + 2(£ irS jk + 1), 

where S jk = B^iB^B^-hnp^K^y 1 Bj k . The trace oi S jk shows the degree 
of freedom of the fitted curve and is a great help. That is to say, if trS jk is 
nearly 2, the dependency is looked upon linear. We use both BNRC and AIC 
for decision whether we add up to a parent variable. By using this method, 
the estimated graph and structures are closed to ture model. 

We analyze the S.cerevisiae cell cycle data discussed by Spellman et aL 
(1998) 26 and Friedman et aL (2000) 13 . The data were collected from 800 
genes with 77 experiments. 



CLN2 CDCS 




Figure 3: BNRC scores for CNL2, CDCS and SVSl- 

We set the prior probability 7r G is constant, because we have no reason why 
the large graph is unacceptable and no information about the size of the true 
graph. The nonparametric regressiors are constructed 20 2?-splines- In fact, 
the number of B-splines is also parameter. However, we use somewhat large 
number of B-splines, the hyper parameters control the smoothness of fitted 
curve and we cannot visually find differences among fitted curves corresponding 
with various number of B-splines. 
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The results of the analysis can be summarized as follows: Figure 3 shows 
BNRC scores when we predict CLN2, CDC5 and SVS1 by one gene. The genes 
which give smaller BNRC scores give a better expression the target gene. We 
can observe that which gene is associated with target gene and we find the 
gene set which strongly depend on. In fact, we can construct a brief network 
by using these information. We can look upon the optimal graph as a revised 
version of the brief network by taking account of the effect of interactions. 
We note that if there is a linear dependency between genes, the score BNRC 
is also good when the parent-child relationship is reversed. Therefore, the 
directions of the causal associations in the graph are not strict especially when 
the dependencies are almost linear. Our result basically advocates the result 
of Friedman et al (2000) 13 , but, of cause, there are different points in parts. 
There are some genes that mediate Friedman, et al J s result, such as MCD1, 
CSI2, YOX1 and so on. Most of these genes had been reported to play an 
important role. A large number of the relationships between genes are nearly 
linear. But we could find some nonlinear dependencies which linear models 
■ hardly find/ Figure 5 shows the estimated graph associated with genes. which 
were classfied their processes into cell cycle and their neighborhood. Here, 
we omit some branches in Figure 5, but important information are almost 
shown. As for the networks given by us and Friedman et aL 13 , we confirmed 
parent-children relationships and observed that both two networks are similar 
to each other. Especially, our network includes typical relationships which 
were reported by Friedman et al 13 . As for the differences between both 
networks, we attention the parents of SVS1. Friedman et. al 13 employed 
CLN2 and CDC5 as the parent genes of SVS1. On one hand, our result gives 
CSI2 and YKR090W for SVS1. We check up on the difference of these two 
results. -After all, in the sense of BNRC and AIC, our candidate parent genes 
are more appropriate than Friedman et al 13 's. The reason might be the 
effect of discretizing, because our model suitably fits to both cases in Figure 4. 
Especially we conclude that CDC5 gives just weak effects to SVS1 compared 
with other genes only in the case in which Spellman 20 's data (see also Figure 
3). In fact, as the parent gene of SVS1, the order of BNRC score of CDC5 
is 247th. Considering the circumstances mentioned above, our method can 
provide us valuable information in understandable and useful form. 
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Figure 4: Cell cycle data arid smoothed estimates, 
(a) and (b) Friedman c£ al. (2O0O) 13 , BNRG = 57.71, A1C=167.96; 
(c) and (d) Proposed method, BNRC = 32.53, Al'C=140.16. 



6 Discussion 

We proposed the new method for estimating genetic networks from microarray 
gene expression data by using Bayesian network and nonparametric regression. 
We derived anew criterion for choosing graph theoretically, and represented its 
effectiveness though the analysis of the cell cycle data. The advantages of our 
method are mainly as follows: We can use the expression data as continuous 
values. Not" only linear dependencies, we can also detect nonlinear structures 
and can visualize their functional structures being easily understandable. Fully 
automatic search can accomplish the creation of optimal graph. 

We also pointed out that Friedman et al 13 *s method remained the un- 
known parameters such as threshold value for discretizhig and hyper param- 
eters, in the Dirichlet priors which selected by trial and error and were not 
optimized. in a narrow sense. On the other hand, our proposed method can 
automatically and appropriately estimate any parameters based on proposed 
criterion which has a sounder theoretical basis. Besides, our method includes 
Friedman et al 13> s as a special case. 

We consider the following problems for our future works: (1) We used 
the statistical models based on Gaussian distribution However, we derive the 
criterion BNRC in more general situations. In fact, we can construct the graph 
selection criterion based on other statistical models. (2) It is a possible case 
that the outliers cause strange results. Thus, the development of the robust 
methods and the technique for detecting the outliers are important problems. 
(3) The intensities of the unions are probably measured by using bootstrap 
method 9 , We would like to investigate these problems in a future paper. 
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Figure 5: Cellcycle data rteult. 
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Example 2 

Nonlinear Modeling of Genetic Network by Bayesian 
Network and Nonparametric Regression with 
Heterogeneous Error Variances and Interactions 



ABSTRACT 

"We propose a new statistical method for construct- 
ing genetic networks from rnicroarray gene expres- 
sion data based on Bayesian networks- In the con- 
text of Bayesian network, an essential point of net- 
work construction is in the estimation of the con- 
ditional distribution of each random variable. We 
consider fitting the nonparametric regression models 
with* heterogeneous error variances and interactions 
for capturing the nonlinear structures between genes. 
Once we set a graph by using Bayesian network add 
nonparametric regression, a problem still remains to 
be solved in selecting an optimal graph, which gives 
a best representation of the system among genes. 
We theoretically derive a new criterion for choos- 
ing graph from Bayes approach in general situations. 
The proposed method contains the previous methods 
for estimating genetic networks by using Bayesian 
networks. Wc demonstrate the effectiveness of pro- 
posed method through the analysis of Saccharornycc$ 
cerevisiae gene expression data newly obtained by 
disrupting 100 genes. 

Key words: microarray gene expression data; 
Bayesian network; nonparametric regression; het- 
croscedasticity; interaction; posterior probability 



1. INTRODUCTION 

Under the development of the microarray technol- 
ogy, we can observe the gene expression data of 
thousands* now simultaneously. In the analysis 
of gene expression data, constructing genetic net- 
work receives a large amount of attention, in the 
field of molecular biology and bioinformatics (see 
PLW»[*]»[l»Ml8],[l«] and [22]). However, the causes 
of dimensionality and complexity of the data dis- 
turb the progress of the microarray gene expression 
' data analysis. That is to say, the mmrinatioii tnat" 
we want is buried with a huge amount of the data 
with noise. In this paper, we propose a new statis- 
tical method for constructing genetic network that 
can capture even the nonlinear relationships between 
genes clearer- 

Bayesian network ([19]) is an effective method in 
modeling the phenomena through the joint distri- 
bution of a large number of random variables. In 
recent years, some interesting works have been es- 
tablished in constructing genetic networks from the 
microarray gene expression data by using Bayesian 
networks. Eriedman and Goldszmidt [12] disexctized 
the expression values and assumed the multinomial 
distributions as candidate statistical models. Pc'cr 
et aL [22] investigated the threshold value for dis- 
claiming. On the other hand, Rriedxnan et al [13] 
pointed out that the discretizing probably looses the 
information of the data and considered to fit the lin- 
ear regression models, which analyze the data in the 
continuous. However, the assumption that the par- 
ent genes depend linearly on the objective gene is 
not always guaranteed* Imoto ot aL [18] proposed 
the use of nonparametric additive regression models 
(see also [16]) for capturing not only linear dependen- 
cies but also nonlinear structures between "genes. In 
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this pap cr" wc" propose f EelnethbdTor constructing 
the genetic network by using Bayesian networks and 
tike nonpaxametric hcteroscedastie regression model, 
which is more resistant to the effect of outliers and 
can also capture the effects of the interactions of* par- 
ent genes. 

Once we set the graph, we have to evaluate its 
goodness or closeness to the true graph 7 which is 
completely unk nown. Hence, the construction of a 
suitable criterion becomes the center of attention 
of statistical genetic network modeling. EWedman 
and Goldsamidt [12] derived the criterion, BDe, for 
choosing graph based on the multinomial models and 
Dirichlct priors. However* they remained the un- 
known hyper parameters in Dirichlet priors and we 
only set up the value? expericntially. We investi- 
gate the graph selection problem as the statistical 
model selection or evaluation problem and theoreti- 
cally derive a new criterion for choosing graph from 
Bayes approach ([8]). The proposed criterion auto- 
matically optimizes the all parameters in the model 
and gives the optimal graph. In addition, our pro- 
posed method includes the previous methods for con- 
structing genetic network by Bayesian network- To 
show the effectiveness of the proposed method, we 
analyze gene expression d&ta of Saccharomycv? cere- 
vis iae by disrupting 100 gencs. 

2. BAYESIAN NETWORK AND NON- 
PARAMETRIC HETEROS CED ASTIC 
REGRESSION MODEL WITH INTER- 
ACTIONS 

2.1 Nonlinear model in Bayesian network 

Suppose that we have n sets of data . - ,a? n } 

of the jD-dimensional random variable vector 
J£ — (Xi 7 -„, Xp) T * where = (xn, &\p) T and 
cc T denotes the transpose of as. In rnicroarrny gene 
expression data, n and p correspond to the n umb ers 
of arrays and genes.- Under the Bayesian network 
framework, we consider a directed acyclic graph 
G and Markov assumption between nodes. The 
joint density function is then decomposed into, the 
conditional density (see also [10]) of each variable, 
that is, 

f&il , » - , Xi P ) = h {*n \pn) X - • x f P (*ip}Pip)-, (1) 

where p^ = (?a j-m^) 7 are ^-dimensional par- 
ent observation vectors of x^j in the graph G. When 
JP*i — (^T 2 , Xz) T is the parent variable vector of Xt 7 
we see p ix = (^i2>arfs) r 7 (i = l,»-,n). Through the 
formula (l), the focus of interest in statistical mod- 



eling bf :;%Ye$iarf y^®4l^!fe'^^.iMJ« instruct 
the conditional densities, fj- We assume that the 
conditional densities, fj, arc parameterized by the 
parameter vectors 0j 7 and the effective information 
is extracted from these probabilistic models, 

Imoto et ah [18] proposed the use of nonparamet- 
ric regression strategy for capturing the nonlinear 
relationships between at^ and In many cases, 
this method can capture the objective relationships 
very well When the data, however, contain outliers 
especially near the boundary on the domain {pij}? 
nonpar ainetric regression models sometimes lead to 
unsuitable smoothed estimates, i.e. 7 the estimated 
curve exhibits some spurious wavincss due to the 
effects of the outliers- Since what is estimated is 
the system of a being, the too much complicated 
relationship is unsuitable. In fact, this inappropri- 
ate case sometimes occurs in the real data analy- 
sis. To avoid this problem, we consider fitting the 
nonparametric regression model with heterogeneous 
error variances 

Xij = mjipij) 4- e ijy dj ~ N (0, <r?-), (2) 

where 7ny(-) is a smooth function from R q > to R and 
JV(//,tr 2 ) denotes Gaussian, distribution with mean 
\i and variance a 2 - Here R denotes a set of xeal 
numbers. This model includes Imoto et al. [lSfs 
model and, clearly, the linear regression model as 
special cases. 

One possible approach to the construction of the 
systematic component, mj{pij)j is the nonparamet- 
ric additive regreSSOT [18] 

mjtPij) = m/i <P%) + • ■ ■ + m^Cpg), (3) 

which is a straightforward extension of the mul- 
tiple linear regression. In general, each smooth 
function is characterized by the rt values 

m Jfc(Pi& )i'-'i m jk(p^l) s^d the system (3) contains 
n x qj parameters- Then the number of the parame- 
ters in the model is much larger than the number of 
observations and it has a tendency toward unstable 
parameter estimates. In this paper, we construct the 
smooth function vrtjki*) by basis functions approach 

where ...,7^ 4 h are unknown coefficient pa- 

rameters and &ia ^jM^fcO 0X0 basis func- 
tions* From this represent ation ? the n parameters 
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Mjk coefficient parameters 7^?,— ^*M jh k~ ^ n *kc 
other hand, our interest is in the effects of interac- 
tions, we can construct the interaction model 

^•(p^^E^rf). ( 4 ) 

where p*jjp is the subvector of p £7 -, that is, if we inter- 
est in the interaction of and ^ve then obtain 
<pW = (p§\p^) T . We can also construct the func- 
tion (*) by 

where c{j^(pj^) axe basis functions and axe pa- 
rameters. Combining (3) and (4) a the nonparametric 
regrcssor with interactions is generally given by 

In the error variances, ofj, we assume the following 
structures; 

°% = * = if-v"* i - i.-iPi ( 6 ) 

where tv\j 7 ... :t v7 Tl j are constants and <Tj is an mi- 
known parameter. By setting up the constants 
wijj „. y tu n j in reflecting the feature of the error vari- 
ances, we can represent the heteroscedasticity of the 
data. Combining (2), (&) and (6), we obtain a non- 
parametric regression model with, heterogeneous er- 
ror variances and interactions 

-T^liptP)}*]* CO 
r=l J 

where -7^ and axe M^-dimensional vectors 

given by, respectively, 

~ _ fJS) (f) NT 
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given "by, respectively, 

Thus the parameters in the model (7) are 7^ = 

(T&.-.T&rVfc = (€i,.-.,^) r and 4 If the 
variable Xy has no parent variables in the graph, we 
specify the model based on the normal distribution 
with mean fXj and variance <rj. Hence the Bayesian 
network model is given by 

where #g = (^iV- >9p) T tne parameter vector 
included in the graph G and 0j is the parameter 
vector in the conditional density fj 7 that is, we see 

oj = °* »j - (^,of) r . 

2.2 Criterion for choosing graph 

Once wc set a graph, the statistical model (8) based 
on the Bayesian network and nonparametric regres- 
sion can be constructed and be estimated by a suit- 
able method... However, ;fche problem that still re- 
mains to be solved is how wc can choose the optimal 
graph, which gives a best approximation of the sys- 
tem underlying the data. Notice that we can not 
use the likelihood function as a model selection cri- 
terion, because the value of likelihood becomes large 
in a more complicated model Hence, it needs to 
consider the statistical approach based on the gener- 
alized or predictive error, Kullback-Lcibler informa- 
tion, Baycs approach and so on ([20]), In this sec- 
tion, wc construct a criterion for evaluating graph 
based on our model (8) from Bayes approach. 

When we set a graph, the criterion for evaluating 
the goodness of the graph can be constructed from 
Bayesian theoretic approach as follow; The posterior 
probability of the graph is obtained by the product 
of the prior probability of the graph, 7r a , and the 
marginal probability of the data. By removing the 
Standardized constant, the posterior probability of 
the graph is proportional to 

*(G\X n ) = 7r a / f[ f(x Vl 0a)^{eo\X)de C7 (9) 

J z=l 

where X n = (a?i,-.,a? n } T is an n x n gene profile 
matrix, 7r(#c?|A) is the prior distribution on the pa- 
rameter Bo .satisfying log7r(#G|A) = 0(n) and A is 



79 



WO 03/027262 



PCT/US02/31093 



the lrj^er^par^m^ Bayes approach, /where 

we can choose the optimal graph, such that ir{G\X v ) 
is maximum. A crucial problem for constructing a l^{Bj\X n ) 
criterion based on the posterior probability of the 
graph is the computation of the high dimensional 
integral (9). Friedman and Goldszrnidt [12] used the 
conjugate priors for solving the integral and gave a 
closed-form solution. To compute this high dimen- 
sional integral, we use Laplace's apprcodmation ([ll] 7 
[17] and [26)) for integrals 



~ log fj{Xij\PijlOj) + ^ logTT^lX,). 



Here A* is the hyper parameter vector. Hence by 



defining 
BNRCj 
- -2 log 



J -;=i 

where r is the dimension of 9a, 
h(9 G \X n ) = -JZlogf(xi;ea) + ilogTr^olA), 
M0 G ) = -&{U9 a \X n )} /MGd&G 



where irz,. are prior probabilities satisfying 
J2^=\ log 7r L j = log 7v G: the BNRC score is given by 
the sum of the local BNRC scores 



BNRC = J^BNRCy. 



(H) 



7=1 



and 9 G is the mode of b(0<?|-X»)- Then we hare a 
criterion, BNRC, for selecting graph 

BNRC(G) 

- —2 log 

« -2 log 7v G ~r log(27r/rt) 4- log \J^{0 O )\ 
-2nl x {e a \X n ). (10) 

The optimal graph is chosen ouch that the criterion 
BNRC (10) is minimaL The merit of the use of the 
Laplace mctnod is that it is not necessary to consider 
the use of the conjugate prior distribution. Hence 
the modeling in the larger classes of distributions of 
the model and prior is attained. 

Suppose that the parameter vectors Oj are inde- 
pendent one another, the prior distribution then can 
be decomposed into 

Therefore, log\J x {9 0 \X n )\ and nlx{0 G \X n ) in (10) 
arc, respectively, 



The smoothed estimates based on nonpaTametric re- 
gression are obtained by replacing the parameters -yj 
and £j by ^ and £p respectively. Noticed that we 
derive the criterion, BNRC, under the assumption 
log 7t{6 g \X) — 0{n). If we use the prior density sat- 
isfying log7r(#c!^) = O(l), the BNRC criterion is 
resulted in Schware [25] 's criterion known as BIG 01 

SIC. In such case, the mode Q G is equivalent to the 
m&Dtmmm likelihood estimate. 

3. ESTIMATING GENETIC NETWORK 

3.1 Nonpar ametric regression 

In this section we practically present the mcthoc 
for constructing genetic network based on proposec 
method described in Section 2. First we would lik< 
to mention the nonpaTametric regression modcL Th< 
nonpaxamctric repressor (5) has the two components 
the main effects represented by the additive mode 
of each parent genue and the interactions. In tin 
additive model, we construct each smooth function 
nyjbO b y ^-splines ([9] and [18])- 

In the interaction terms, we use Gaussian radia 
basis function 



: exp 



io E \j^(e a \x n )\ - ^iog 



where Xji is a center vector, s? t is a width param 
eter and Cji a nv P er parameter. In the contex 
of the regression modeling based on the radial basi 
functions, there are two methods for estimating th 
centers zji and the widths First, Zji and ar 
estimated by m in imizing or maximizing a suitabl 
object function like the squared loss and likelihood 
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This mcthod^ U c^ed fuUy^snpervised learning. On 
one hand, an alternative method determines Xji and 
Sji by usmg only the parent observation data 
in advance. In this paper, we employ, the latter 
method and use the fc-mcans clustering algorithm 
for constructing the basis functions. The details of 
the radial basis functions ate shown in [7], [21] and 
[23], The hyper parameters control the amount 
of overlapping among, basis functions. 

In the error variances, we consider the het- 
eroscedastic regression model and assume the struc- 
ture (6). The design of the constants wij 7 .., 7 w n j 
is an important problem for capturing the het- 
erosccdasticity of the data, Wc set the weights as 

where p is a hyper parameter, pj ~ YliLiPij/ n anc * 



weights are -u^ 



1 and the model has 



homogeneous error variances. If we use the large 
value of p y the error variances of the data, which 
exist near the boundary on the domain of the parent 
variables, are large. Hence, if there are the outliers 
near the boundary, we can reduce their effects and 
gain the suitable smoothed estimates by using the 
appropriate value of p. 

3.2 Priors 

Suppose that the prior distribution 7Tj{0j\\j) Is fac- 
torized into 

where \jh and i/ji arc hyper parameters. We use a 
singular Mjk variate normal distribution as the prior 
distribution on *>jfc> 

x exp (-^^rJ^iJfe-yy*) , (13) 

where Kjh & an Mj* * .My* symmetric posi- 
tive scmidefiiiit matrix satisfying ^ifk^j^ljk = 

T£M7$ -2y^ k +^l^ 2 - The prior distri- 
bntions on £jz axe 

where Wjj is the dimension of f^- 



Next pyfcoSyeiiffi f£ol&1W tlte graph 
ttq- Eriedman and Goldszinit [12] employed the prior 
based on the MDL encoding of the graph. In our con- 
text, the marginal probability of the data is seemed 
to the type IT likelihood adjusted by the hyper pa- 
rameters. Thus we set the prior probability of the 
graph, nch 

7T a ~ Gxp{— (No. of hyper parameters)} 
v p 
= H**p{-{V + 2sj+l)} « II *V 

The justification of this prior is based on Akaike [2] 7 s 
Bayesian information criterion, known as ABIC, and 
Akaikc [ij's information criterion, AIC. 

3-3 Criterion 

In section 2.2, wc derived the criterion, BNRC, for 
choosing the graph in general framework. By using 
the equation (11), the BNRC score of the graph can 
be obtained by the sum of the local scores, BNRC,. 
Tho.result is summarized in the following theorem. 

Theorem 1. Let f(sa\0o) = n§=i fj(*y\Pij'i *i) 
be a Bayesian network and nonparametric regres- 
sion model given by (8)> and let ^{jj^jh) an ^ 
be the prior densities on the parameters 
-y jk and ^ji defined by (13) and (14), respectively: 
Then a criterion for evaluating graph is given by 
BNRC = Y$ asl BNRCy, where 

BNRC, = 2(qj 4- 2s j 4- 1) 

-iJ2 M jk + jh L jX + 1) log(27r/n) 

~ 53 «W + 71 log(27r^) -f- n 
i=i 

+ E0<>s|Ay ft | - My*]og(7UT?)} -log(2^) 
*i 

+ £{log|I>| - Ljilo&(n&?)} 
■ J=i 

+ E((M ifc - 2) log (^J - log \K jh \+ 



with 
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% = (cg > (pg , ),.^cg e (p»)) r ; (nx£ j7 ), 
= <iiag(ti;y,...,ta„j); (nx») . 
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i=l 



-E^7(P^)} 2 M 
£=1 

Merc -we approximate the Hessian matrix by 



log 



d9jd9j 



<6 



i=l 



•log 



2\2 



3,4 Penalized likelihood 

Consider the nonparametric regression model de- 
fined in (7)_ When the estimation method 
is the maximum likelihood method which max- 
imizes the log-likelihood function l{y^ £j ? tff ) = 
73Lx^fA&ij\Pifinrj>£j*rf)i then the parameter 
estimates yield unstable and lead to ovcrfitthig due 
to the flexibility of the model- In our method, the 
estimate 0j = (^J, £j , is obtained as a mode of 
lx 3 {0j\K n ) and the estimation method is equivalent 
to the maximum penalized likelihood ((14) and [15]), 
which includes the maximum likelihood method as a 
Special case. 

The mode 0j depends on the hyper parameters. In 
fact, the hyper parameter plays an essential role for 
estimating the smoothed curve. We use the distribu- 
tions (13) and (14) as the prior distributions on -jj k 
and respectively. Then (0j\ Xn) is resulted in 
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Figure 1: The smoothed estimates by the various 
values of the hyper parameter. 



That is, if the smoothing parameter is small, the es- 
timated curve close to the data, what is called over- 
fitting. On the other hand 7 the large value of the 
smoothing parameter is used, the estimate is nearly 
linear fit. Hence, the choice of the hyper parameter 
plays an essential role for capturing the relationship 
between genes. The performances of the hyper pa- 
rameters are shown in the next section. 



t-1 



*=1 * 1=1 * 

The first term in the right hand side is the log- 
likelihood function and the second and third terms 
are called roughness penalties. The hyper param- 
eters Aj-fc and Vji are called smoothing parameters 
which control the smoothness of the fitted curves. 



4. REAL DATA ANALYSIS 

In this section we show the effectiveness of the pro-, 
posed method through the analysis of S~ cercvisiae 
gene expression data. Our research group has in- 
stalled a systematic experimental method, which ob- 
serves the changes of expression level of genes on 
a microarray by gene disruption. By using this 
method, we have launched a project whose purpose 
is to reveal the gene regulatory networks between the 
5871 genes of Sacckaromyces ccrcvisiae while many 
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(a) 



YOM167C ►YGR109C 



\ 



X <b) / 



*,0 



1.5 



(b) 

Figure 2: The effect of the weight constants. 



laboratories have also reported similar projects. We 
have already collected a large number of expression 
profiles from gene disruption experiments to evalu- 
. ate genetic regulatory networks. Oyer 400 mutants 
are stocked and gene expression profiles are accumu- 
latiag- 

We monitored the transcriptional revel of 6371 
genes spotted on a microaxray by a scanner. The ex- 
pression profiles of over 400 disruptants were piled 
up in our database. The standard deviation (SD) 
of all genes on a microarr&y were evaluated "and the 
value of SD represented roughly the error of a ex- 
periment. In our data, we estimated the value of 0,5 
is the critical point of the accuracy of experiments. 
We have evaluated the accuracy of those profiles on 
the base of the standard deviation of the expression 
ratio of all genes. 107 disruptants including 68 mu- 
tants where the transcription factors were disrupted 
could be selected from 400 profiles. 

We used 100 microarrays and constructed the ge- 




Figure 3; The estimated surface by the nonparame- 
teric regression with interaction. 



netic network of 521 genes from the above data. Be- 
cause, the 94 transcription factors whose regulating 
genes are clearly identified were found, and the pro- 
files of the 521 genes in control by 04 factors were se- 
lected from 5871 profiles. In our model, we construct 
the nonparametric regression model by 20 i?-splincs 
and 20 radial basis functions. We confirmed that 
the differences of the smoothed estimates against the 
various number of the basis functions can not visu- 
ally found, because when we use the somewhat large 
number of the basis functions, the hyper parame- 
ters control the smoothness of the fitted curves. We 
applied the two genes effect to tbc interaction com- 
ponents. Hence, the effects of the interactions are 
obtained as the fitted surfaces and can be visually 
understandable. 

Before we mention the result of this analysis, we 
show the roles of the hyper parameters in the prior 
distributions and weight constants. Figure 1 (a) 
shows the scatter plot of YGL237C and YEL071W 
with smoothed estimates by 3 different values of the 
hyper parameters. Clearly, the smoothed estimate 
strongly depends on the values of the hyper parame- 
ters. Figure 1 (b) is the behavior of the BNHC crite- 
rion of the two genes in Figure 1 (a). We can choose 
the optimal value of the hyper parameter as the min- 
imizcrs of the BNRC and the optimal smoothed esti- 
mate (solid curve) can capture the structure between 
these genes wclL The dashed and dotted curves 
arc near the maximum likelihood estimate and the 
parametric linear fit, respectively. The effect of the. 
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Figure 4: The resulted partial network of tlie analysis of 521 Saccharorntjces cerevisiae genes. 
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weig^conVtaiits «£Jls shown in Figure 2 (a). 

If we use the homoscedastic regression model, we 
obtain the dashed curve, which exhibits some spu? 
rious waviness due to the effect of the data in the 
upper-left corner. By adjusting the hyper parame- 
ter p in (12), the estimated curve is resulted in the 
solid curve. The optimal value of p is also chosen by 

•mini -mi -zin g 

the BNRC criterion (see Figure 2 (b)). 
Of course, when the smoothed estimate is properly 
obtained, the optimal -value of p tends to zero. 

We employ two step strategy for fitting the non- 
parametric regression model with interactions. First, 
we estimate the main effects represented by the ad- 
ditive B spline repressors. Next, we fit the Inter- 
action components to the residuals. Figure 3 is an 
example of the fitted surface to the relationship be- 
tween YIL094C and its parent genes, YKL152C and 
YER055C. It is clearly, shown that the interaction of 
the two parent genes leads more overexpress when 
both parent genes increase. 

The results of the analysis and their evaluations 
arc as follows: In Saccharomyces cerevisiae the 
GCN4 gene encodes the transcriptional activator of 
the "general control" system of amino acid bioyn- 
thesis, a network of at least 12 different biosyn- 
thetic pathways [6]- Experiments showed that the 
consequences of the general control -response- upon 
the signal "amino acid starvation 3 ' induced by the 
histi-dinc analogue 3-arninotriazole with respect to 
Gcn4p levels. GCN4 activates transcription of more 
than 30 genes involved in the biosynthesis of 11 
amino acids in response to amino acid starvation 
or impaired activity of tRNA synthetases (sec [24]). 
Purine biosynthetic genes ADEl, ADE4, ADE5,7, 
and ADE8 display GCN4-dependent expression in 
response to amino acid starvation [24]. GCN4 ac- 
tivates transcription of biosynthetic genes for amino 
acids and purines in response to either amino acid ox 
purine starvation [24]- Those results of experiments 
shows that there are the strong relations between 
purine metabolism and amino acid metabolism by a 
GCN4. Our map of relation realized well the many 
relations among purine and aminoacid metabolism. 

5. CONCLUSION 

In this paper we proposed a new statistical method 
for estimating genetic network from rniexoarray gene 
expression data by using Bayesian network and non- 
parametric regression. The key idea of our method i$ 
the use of the nonparametric heteroscedastic regres- 
sion models for capturing nonlinear relationships be- 
tween genes and heteroscedasticity of the expression 



data. fS^^&tW ilnstnic- 
tion is in the evaluation of the graph. We investi- 
gated this problem as a statistical model selection or 
evaluation problem and derived the new criterion for 
selecting graph from Bayes approach. Our method 
covers the previous methods for constructing genetic 
networks by using Bayesian networks and improves 
them in the theoretical and methodological sense. 
We showed the effectiveness of our method through 
the analysis of Saccharomyces cerevisiae gene ex- 
pression data and evaluated the resulted network 
by comparing with the knowledge of biology. We 
construct the genetic network without the biological 
information. Nevertheless, the resulted network in- 
cludes the many important connections, which agree 
with the biological knowledge. Hence, wc expect 
that our method can demonstrate its power against 
the analysis of completely unknown system like the 
human genome. 
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Bayesian Network and Nonparametric Heteroscedastic Regression for Nonlinear 

Modeling of Genetic Network 



Abstract 

We propose a new statistical method for constructing a 
genetic network from microarray gene expression data by 
using a Bayesian network. An essential point of Bayesian 
network construction is in the estimation of the conditional 
distribution of each random variable. We consider fitting 
nonparametric regression models with heterogeneous error 
variances to the microarray gene expression data to cap- 
ture the nonlinear structures between genes. A problem still 
remains to be solved in selecting an optimal graph, which 
gives the best representation of the system among genes. 
We theoretically derive a new graph selection criterion from 
Bayes approach in general situations. The proposed method 
includes previous methods based on Bayesian networks. 
We demonstrate the effectiveness of the proposed method 
through the analysis of Saccharomyces cerevisiae gene ex- 
pression data newly obtained by disrupting 100 genes. 



1. Introduction 

Due to the development of the microarray technology, 
constructing genetic network receives a large amount of at- 
tention in the fields of molecular biology and bioinformatics 
[3, 4, 5, 14, 15, 17, 22, 28]. However, the dimensionality 
and complexity of the data disturb the progress of the mi- 
croarray gene expression data analysis. That is to say, the 
information that we want is buried in a huge amount of the 
data with noise. In this paper, we propose a new statistical 
method for constructing a genetic network that can capture 



even the nonlinear relationships between genes clearer. 

A Bayesian network [7, 23] is an effective method in 
modeling phenomena through the joint distribution of a 
large number of random variables. In recent years, some 
interesting works have been established in constructing ge- 
netic networks from microarray gene expression data by us- 
ing Bayesian networks. Friedman and Goldszmidt [12, 13, 
14] discretized the expression values and assumed multino- 
mial distributions as the candidate statistical models. Pe*er 
et al. [28] investigated the threshold value for discretizing. 
On the other hand, Friedman et al. [1 5] pointed out that the 
discretizing probably loses information of the data. In fact, 
the number of discretizing values and the thresholds are un- 
known parameters, which have to be estimated from the 
data. The resulted network strongly depends on their values. 
Then Friedman et al. [15] considered fitting linear regres- 
sion models, which analyze the data in the continuous (see 
also [20]). However, the assumption that the parent genes 
depend linearly on the objective gene is not always guaran- 
teed. ImotoeraJ. [22] proposed the use of nonparametric 
additive regression models (see also [16, 18]) for capturing 
not only linear dependencies but also nonlinear structures 
between genes. In this paper, we propose a method for con- 
structing the genetic network by using Bayesian networks 
and the nonparametric heteroscedastic regression, which is 
more resistant to the effect of outliers. 

Once we set the graph, we have to evaluate its good- 
ness or closeness to the true graph, which is completely un- 
known. Hence, the construction of a suitable criterion be- 
comes the center of attention of statistical genetic network 
modeling. Friedman and Goldszmidt [14] used the BDe 
criterion, which was originally derived by [21] for choos- 
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ing a graph. The BDe criterion only evaluates the Bayesian 
network model based on the multinomial distributions and 
Dirichlet priors. However Friedman and Goldszmidt [14] 
kept the unknown hyper parameters in Dirichlet priors and 
we only set up the values experimentally. We investigate 
the graph selection problem as a statistical model selection 
or evaluation problem and theoretically derive a new cri- 
terion for choosing a graph using the Bayes approach (see 
[6]). The proposed criterion automatically optimizes all pa- 
rameters in the model and gives the optimal graph. In addi- 
tion, our proposed method includes the previous methods 
for constructing genetic network based on Bayesian net- 
work. To show the effectiveness of the proposed method, 
we use the Monte Carlo simulation method. We also an- 
alyze gene expression data of Saccharomyces cerevisiae 
newly obtained by disrupting 100 genes. 

2. Bayesian Network and Nonparametric Het- 
eroscedastic Regression Model 

2.1. Nonlinear Bayesian network model 

Suppose that we have n sets of array data {xj , x n } 
ofp genes, where x; = (x u , x ip ) T and x T denotes the 
transpose of x. In the Bayesian network framework, we 
consider a directed acyclic graph G and Markov assumption 
between nodes. The joint density function is then decom- 
posed into the conditional density of each variable, that is, 

p 

i=i 

where p {j = — ,p%]) T are g^-dimensional parent ob- 
servation vectors of x^ in the graph G. When gene 2 and 
gene 3 are parent genes of genei, we see p n = (x i2 , Xiz) T y 
(i = l,...,n). Through formula (1), the focus of interest 
in statistical modeling by Bayesian networks is how we can 
construct the conditional densities, fj~ We assume that the 
conditional densities, f jy are parameterized by the parame- 
ter vectors Oj, and the information is extracted from these 
probabilistic models. 

Imoto et al [22] proposed the use of nonparametric re- 
gression strategy for capturing the nonlinear relationships 
between x tJ - and p ti and suggested that there are many 
nonlinear relationships between genes and the linear model 
hardly achieves a sufficient result. In many cases, this 
method can capture the objective relationships very well. 
When the data, however, contain outliers especially near the 
boundary of the domain {Pij}, nonparametric regression 
models sometimes lead to unsuitable smoothed estimates, 
i.e., the estimated curve exhibits some spurious waviness 
due to the effects of the outliers. Since what is estimated 



is the system of a living nature, a too complicated relation- 
ship is unsuitable. In fact, this inappropriate case unfor- 
tunately sometimes occurs in the analysis of real data. To 
avoid this problem, we consider fitting a nonparametric re- 
gression model with heterogeneous error variances 

xij = (pi?) + • • • + m jg . {p\ J g ]) + e ijy (2) 

where Eij depends independently and normally on mean 
0 and variance cr?- and mjfc(-) is a smooth function from 
R to R. Here R denotes a set of real numbers. This 
model includes Imoto et al [22]'s model and, clearly, 
the linear regression model as special cases. In general, 
each smooth function m jk {-) is characterized by the n val- 
ues m jk (p$),~-,rn jk (p { r!b and the system (2) contains 
(n x Qj + n) parameters. Then the number of the parameters 
in the model is much larger than the number of observations 
and it has a tendency toward unstable parameter estimates. 
In this paper, we construct the smooth function mjfc(-) by 
the basis functions approach 

TTX—1 

where 7^»—,7^. fc jfc ar ^ unknown coefficient parameters 
and 6ift(")i—> & /^ fc fc(0 are basis factions. From tnis 
representation, the n parameters rn jk (j}[ J ^) 7 m jk (j>„l) 
are reparameterized by the Mj k coefficient parameters 

Tut >"*>7w. fc jfc- 

We strongly recommend the use of nonparametric re- 
gression instead of linear regression, because linear regres- 
sion cannot decide the direction of the Bayes causality or 
leads to the wrong direction in many cases. We show the 
advantage of the proposed model compared with linear re- 
gression through a simple example. Suppose that we have 
data of genei and gene 2 in Figure 1 (a). We consider the 
two models genei -> gene 2 and gene 2 -» genei, and obtain 
the smoothed estimates shown in Figure 1 (b) and (c), re- 
spectively. We decide that the model (b: genei -> gene 2 ) 
is better that (c: gene 2 — ► genei) by the proposed criterion, 
which is derived in a later section (the scores of the models 
are (b) 120.6 (c) 134.8). Since we generated this data from 
the true graph genei -> gene 2 , our method yields the cor- 
rect result However, if we fit the linear regression model to 
this data, the model (c) is chosen (the scores are (b). 156.0 
(c) 1 35.8). The method, which is based on linear regression, 
yields an incorrect result in this case. 

Consider the case that the relationship is almost linear. 
Our method and linear regression can fit the data appropri- 
ately. However, it is clearly difficult to decide the direction 
of Bayes causality. In such a case, the direction is not strict 
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genel genel gene2 

(a) (b) (c) 

Figure 1. Simulated data: The true causality is genej -> gene 2 . (a) Scatter plot of the simulated data, 
(b) Smoothed curve of the graph genei -> gene 2 . (c) Smoothed curve of the graph gene 2 ->• genei. 
These curves are obtained by the proposed method. 



In the error variances, we assume the structures, 

where W\j y ..-,itf n j are constants and cr| is an unknown pa- 
rameter. By setting up the constants wij t ...,iv n j in re- 
flecting the feature of the error variances, we can represent 
the heteroscedasticity of the data. Combining (2) and (3), 
we obtain a nonparametric regression model with heteroge- 
neous error variances 

k=i J 

where yj k and Ojfe(p^) are Mjk -dimensional vectors 
given by, respectively, -y jk = (7$, -Wm]^ and 

&i*(p£) = (&ii ) (>li ) ),...,^ fcJfe (pH ) )) r . If thei-th gene 
has no parent genes in the graph, we specify the model 
based on the normal distribution with mean ft and vari- 
ance Oj- Hence, we define the nonlinear Bayesian network 
model 

/(«« o G ) = n f&i>\Pii', (5) 

where = #J) T is the parameter vector included 

in the graph G and &j is the parameter vector in the con- 
ditional density fj, that is, we see Oj = (7j\tf|) T or 



2.2. Criterion for choosing graph 

Once we set a graph, the statistical model (5) based on 
the Bayesian network and nonparametric regression can be 
constructed and be estimated by a suitable procedure. How- 
ever, the problem that still remains to be solved is how we 
can choose the optimal graph, which gives a best approx- 
imation of the system underlying the data. Notice that we 
cannot use the likelihood function as a model selection cri- 
terion, because the value of likelihood becomes large in a 
more complicated model. Hence, we need to consider the 
statistical approach based on the generalized or predictive 
error, Kullback-Leibler information, Bayes approach and so 
on (see e.g., [1, 24, 25] for the statistical model selection 
problem). In this section, we construct a criterion for evalu- 
ating a graph based on our model (5) from Bayes approach. 

The posterior probability of the graph is obtained by the 
product of the prior probability of the graph, tt g , and the 
marginal probability of the data. By removing the standard- 
izing constant, the posterior probability of the graph is pro- 
portional to 

ir(G\X n ) =7v G jf[ f(x i; e G )*(o G \\)de G , (6) 

where X n = (^1 , x n ) T is an n x p gene profile matrix, 
7r(#<?|A) is the prior distribution on the parameter 0 G sat- 
isfying \o§-k{9 g \X) — 0(n) and A is the hyper parameter 
vector. Under Bayes approach, we can choose the optimal 
graph such that ir(G\X n ) is maximum. A crucial problem 
for constructing a criterion based on the posterior probabil- 
ity of the graph is the computation of the high dimensional 
integration (6). Heckerman and Geiger [20] used the conju- 
gate priors for solving the integral and gave a closed-form 
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solution. To compute this high dimensional integration, we 
use Laplace*s approximation [9, 19, 31] for integrals 



/5 



f(x i ' J 9 G M9 G \X)d9 G 



= TTT^^ exp{nZ A (& G ]X n )}{l + O^n" 1 )}, 

\J>i\&G)l x/ 

where r is the dimension of 9 Gy h{&G\X n ) = 
E^Jog/(x i; ^)/n 4- log7r(^t A)/n, J X (9 G ) = 
-d 2 {t x {9 G \X n )}/d9 G d9 G ' and 0 G is the mode of 
lx(0 G \X n ). Then we define the Bayesian network and 
nonparametric heteroscedastic regression criterion, named 
BNRChetero » for selecting a graph 

= -2 log 

« -21og7r G - rlog(2ir/n) + log|A(0 G )| 



-2 rt i A (d G |X n ). 



(7) 



The optimal graph is chosen such that the criterion 
BNRChetero (7) is minimal. The merit of the use of the 
Laplace method is that it is not necessary to consider the 
use of the conjugate prior distribution. Hence the modeling 
in the larger classes of distributions of the model and prior 
is attained. 

Suppose that the parameter vectors 9j are independent 
one another, the prior distribution can be decomposed into 
7t(9 g \X) = 11?=! Therefore, log | J x (9 G \X n )\ 

and nl\{0 G \X n ) in (7) result in, respectively, 



log\Jx(0 G \X n )\ = £>6 

h(0 G \X n ) = f^h^XJ, 



dOjdej 



where l Xj (#il-^n) = 
logTTj^jlAjO/n. Here Aj 
vector. Hence by defining 



is the hyper parameter 



BNRC& 



hetero 



= -2io 6 jy*^^ , 

where -kl 5 are prior probabilities satisfying 
lo g*X; =' logTTG, ^ BNRC^ tcro score is 
given by the sum of the local scores 



ero — 



(8) 




Figure 2. The fitted curve to simulated 
data: The thiri curves are B-splines that are 
weighted by coefficients and the thick curve 
is the smoothed estimate that is otained by 
the linear combination of weighted B-splines. 



The smoothed estimates based on nonparametric het- 
eroscedastic regression are obtained by replacing the pa- 
rameters 'Yj- by 'jj. Noticed that we derive the criterion, 
BNRCh e *ero, under the assumption \og7r(0 G \\) = 0(n). 
If we use the prior density satisfying }ogir{0 G \X) = 0(1), 
the BNRC/, c£er o score results in Schwarz's criterion known 
as B1C or SIC [30]. In such case, the mode 9 G is equivalent 
to the maximum likelihood estimate. 

3. Estimating Genetic Network 
3.1. Nonparametric regression 

In this section we present the method for constructing 
genetic network in practice based on the proposed method 
described above. First we would like to mention the non- 
parametric regression model. In the additive model, we con- 
struct each smooth function rrijk(') by i?-splines [10, 22]. 
Figure 2 is an example of 5-splines smoothed curve. The 
thin curves are ^-splines that are weighted by coefficients 
and thick line is a smoothed curve that is obtained by the 
linear combination of weighted ^-splines. 

In the error variances, we consider the heteroscedastic 
regression model and assume the structure (3). Choosing 
constants wij, w nj is an important problem for captur- 
ing the heteroscedasticity of the data. In this paper, we set 
the weights 

Wij = giPif, Pj) = exp{- Pj \\ Pij - P 5 \?f2s)^ (9) 

where p 5 is a hyper parameter, pj = YJi=\Pijf n and 
s) = 2" w \\Pi 3 -Pi\?Imj- If we set Pj = 0, the weights 
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are toy = — = w n j = 1 and the model has homoge- 
neous error variances. If we use a large value of pj, the 
error variances of the data, which exist near the boundary 
on the domain of the parent variables, are large. Hence, if 
there are outliers near the boundary, we can reduce their ef- 
fect and gain the suitable smoothed estimates by using the 
appropriate value of 

3.2. Priors 

Suppose that the prior distribution ?Tj(Oj\\j) is factor- 
ized as ^(0,-1 A*) = n*=i Kjkhjkfak)* where \ jk are 
hyper parameters. We use a singular M jk variate normal 
distribution as the prior distribution on f jk> 

< ,11 ( 2 * \ {Mi "- 2V ] K ,1/2 

xexp(-^TTA7 3 *), (10) 

where K jk is an M jk x M jk symmetric positive semidef- 
inite matrix satisfying ^yJ k K jk qf jk = X)^s(7a2 " 

Z la-l,k ^ /a~2,fc/ * 

Next we consider the prior probability of the graph ttg- 
Friedman and Goldszmit [14] employed the prior based 
on the MDL encoding of the graph. In our context, the 
marginal probability of the data is equivalent to the type 
II likelihood adjusted by the hyper parameters. Thus we set 
the prior probability of the graph, ttc, 

7T(7 = exp{ — (No. of hyper parameters) } 
p p 

= Uexp{-( qj +i)} = n* £ j- 

The justification of this prior is based on Akaike's Bayesian 
information criterion, known as ABIC [2], and Akaike's in- 
formation criterion, AIC [1]. 

33. Criterion 

We derived the criterion, BNRChetcro, for choosing the 
graph in a general framework. By using the equation (8), the 
BNRCfcctcro score of the graph can be obtained by the sum 
of the local scores, BNRC^ tcro . The result is summarized 
in the following theorem. 

Theorem 1. Let /(ccr, 0 G ) be a Bayesian network and non- 
parametric heteroscedastic regression model given by (5), 
and let ^(Tjftl^jfc) De * e prior densities on the parameters 
^y jk denned by (10). Then a criterion for evaluating graph 

is given by BNRC fcrf cro = £i=i BNRCj£ t4ira , where 



k=l 

n 

- ^2 log Wij + n log(27ra|) 4- n 

+ £{log | A jfc | - M jk log(ncr?)} - log(2a?) 



9i 



+ - 2 ) tog (2™|/n/? ifc ) - log \K jk U 

+riPjklfj k K jk v jk /&]} 7 



with 



Ajfc 
Wj 



= Bf k WjBj k + n/3 jk K jk - 7 {M jk x M jk ), 
= diag(wi i7 w nj ); (n x n) 

= E^i{^-E^^(pH } )} 2 /n. 



Here we approximate the Hessian matrix by 



log 



9i 



s(aj) 2 



3.4. Learning network 



In the Bayesian network literature, it is shown that de- 
termining the optimal network is an NP-hard problem. In 
this paper, we use the greedy hill-climbing algorithm for 
learnign network as follows: 

Stepl: Make the score matrix whose (i, j)-th element is the 
BN0RC£2t cro score of the graph genei -4 gene,-. 
Step2: For each gene, implement one of three procedures 
for an edge: "add", "remove", "reverse", which gives the 
smallest BNRC/> C f Cro - 

Step3: Repeat Step2 until the BNRC/^tero does not reduce. 

Generally, the greedy hill-climbing algorithm has many 
local minima and the result depends on the computational 
order of variables. To avoid this problem, we permute the 
computational order of genes and make many candidate 
learning orders in Step3. Another problem of the learning 
network is that the search space of the parent genes is enor- 
mously wide, when the number of genes is large. Then we 
restrict the set of the candidate parent genes based on the 
score matrix, which is given by Stepl. 
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Figure 3. The smoothed estimates by the various values of the hyper parameters. (a1): The effect 
of hyperparameter p jh in the prior distribution of the coefficients of B-splines. This parameter can 
control the smoothness of the fited curve. (b1) and (d): The effect of hyperparamter Pj in the 
parameter of the error variances. This parameter can capture the heteroscedastisity of the data and 
can reduce the effects of outliers. 



3.5. Hyper parameters 

Consider the nonparametric regression model defined in 
(4). The estimate 0j is a mode of h s (0j\X n ) and depends 
on the hyper parameters. In fact, the hyper parameter plays 
an essential role for estimating the smoothed curve. 

In our model, we construct the nonparametric regres- 
sion model by 20 B-splines. We confirmed that the differ- 
ences of the smoothed estimates against the various num- 
ber of the basis functions cannot be found visually. Be- 
cause when we use a somewhat large number of the basis 
functions, the hyper parameters control the smoothness of 
the fitted curves. Figure 3 (al) shows the scatter plot of 
YGL237C and YEL071W with smoothed estimates for 3 
different values of the hyper parameters. The details of the 
data are shown in later section. Clearly, the smoothed esti- 
mate strongly depends on the values of the hyper parame- 
ters. Figure 3 (a2) is the behavior of the BNRCh etero crite- 
rion of the two genes in Figure 3 (al). We can choose the 



optimal value of the hyper parameter as the minimizer of 
the BNRCfcef ero and the optimal smoothed estimate (solid 
curve in Figure 3 (al)) can capture the structure between 
these genes well. The dashed and dotted curves are near the 
maximum likelihood estimate and the parametric linear fit, 
respectively. 

The effect of the weight constants Wjj , w n j is shown 
in Figure 3 (bl) and (cl). If we use the nonparametric ho- 
moscedastic regression model [22], we obtain the dashed 
curve, which exhibits some spurious waviness due to the 
effect of the data in the upper-left corner (bl). By adjusting 
the hyper parameter pj in (9), the estimated curve results 
in the solid curve. The optimal value of pj is also chosen 
by minimizing the BNRC^ ctero criterion (see Figure 3 (b2) 
and (c2)). Of course, when the smoothed estimate is prop- 
erly obtained, the optimal value of pj tends to zero. 

Finally, we show the algorithm for estimating the 
smoothed curve and optimizing the hyper parameters. 
Stepl : Fix the hyper parameter pj. 
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Step2: Initialize: *y jk ~O y k~ 

Step3: Find the optimal j3 jk by repeating Step3-1 and 
Step3-2 

Step3-1: Compute: 
for fixed p jh . 

Step3-2: Evaluate: Repeat Step3-1 against the candidate 
value of fyk, and choose the optimal value of fak, which 
minimizes the BNRc£2 tero - 

Step4: Convergence: Repeat Step3 for k ~ 
1, 1, ft, 1,~ until a suitable convergence 
criterion is satisfied. 

StepS: Repeat Stepl to Step4 against the candidate value 
of p jy and choose the optimal value of p$ 9 which minimizes 

theBNRC^ero- 

4. Computational Experiments 
4 J. Monte Carlo Simulation 

We use the Monte Carlo simulation method to show the 
effectiveness of our method. The data ware generated from 
the artificial network of Figure 4 (a) with the functional 
structures between nodes as follows: 



X 2 = 

^3 = 

X A = 

X 5 = 

X 6 = 

X 7 = 

X s = 

X 9 = 



Xl + 2 sin(X 5 ) - 2X 7 + e ly e!~ 7\T(0, (4s) 2 ), 
{1 + exp^Xs)}"" 1 + e 2 , ^-^(O,^ 2 ), 
e 3 , ? e 3 ~iV(0,i), 
Xf/3 + e 4 , e 4 ^iV(0 ) (4 S ) 2 ), 

e 6 , e 6 -iV(0,l), 

f -l+e 7 , (*8<-0.5), 

I X B +e 7y (-0.5 < X s < 0.5), 

[ l + e 7 , (0.5 <X 8 ), £ 7 ~W(2s) 2 ), 
exp(-^ 4 - l)/2 + es, e 8 - iNT(0, (2s) 2 ), 
e 9 , iV(0,l), 
cos(JY 9 ) + 6 10 , em - iV(0, (4s) 2 ), 



where 5 is a constant. After transforming the observations 
of the parent variables to mean 0 and variance 1, then the 
observations of the qhild variable are generated. 

We generate 100 observations from this artificial net- 
work and our aim is to rebuild the network in Figure 4 (a) 
from the simulated data. We use two different settings of the 
noise variance, one is s = 0.2 and another is s = 0.1. The 
observations from the setting of the noise s = 0.2 are ex- 
perientially similar to the real microarray data. The Monte 



Carlo simulation was repeated 1000 times and we focused 
on the number of correct estimations. Figure 4 (b) and (c) 
are the results of the Monte Carlo simulations for s — 0.2 
and s = 0.1, respectively. 
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Table 1. The false positives of the Monte Carlo 
simulations. The number attached after node 
name is the number of estimated connec- 
tion without direction information and the per- 
centage is the direction information- For ex- 
ample, in s=0.2, the proposed method esti- 
mated the relationship "X1 -> X4" or "X1 <- 
X4" 15 times from 1000 Monte Carlo simu- 
lations and 87% of 15 times represents the 
direction from left to right ("X1 X4"). 



The results of the Monte Carlo simulations can be sum- 
marized as follows: In the setting of noise variance s — 0.2, 
our model can rebuild the target network very well. Table 
1 shows the false positives of the Monte Carlo simulations 
and we can see the percentages of the false positives are 
almost less than 10%. Since the simulated data is similar 
to the real microarray data in the setting s = 0.2, we can 
expect that our network estimation method can work effec- 
tively in the real data analysis. From Figure 4 (b) and (c) 
and Table 1 , the number of true negatives is much less than 
the number of false positives. We believe that mis tendency 
is preferred in the exploratory data analysis. In the setting 
5 = 0.1, our model can rebuild the target network more pre- 
cisely, and the number, of false positives decrease compared 
with the result of s = 0.2. 
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(a) 00 < c > 



Figure 4 The results of the Monte Carlo simulations, (a) True network, (b) Result for s=0.2. (c) 
Result for s=0 1. The number next to edge represents the number of estimated connections from 
1000 Monte Carlo experiments. The percentage includes the information of the edge direction. For 
example, in s=0.2, the connection between X5 and X1 appeared 958 times from 1000 Monte Carlo 
experiments and those 97% is the correct direction (from X5 to X1). 



4.2. Real Data Analysis 

In this section we show the effectiveness of our proposed 
method through the analysis of Saccharomyces cerevisiae 
gene expression data, which is newly obtained by disrupting 
100 genes. Our research group has installed a systematic ex- 
perimental method, which observes changes in the expres- 
sion levels of genes on a microarray by gene disruption. By 
using this method, we have launched a project whose pur- 
pose is to reveal the gene regulatory networks between the 
5871 genes of Saccharomyces cerevisiae. Many laborato- 
ries have also reported similar projects. We have already 
collected a large number of expression profiles from gene 
disruption experiments to evaluate genetic regulatory net- 
works. Over 400 mutants are stocked and gene expression 
profiles are accumulating. 

We monitored the transcriptional level of 5871 genes 
spotted on a microarray by a scanner. The expression pro- 
files of over 400 disruptants were stored in our database. 
The standard deviation (SD) of the levels of all genes on 
a microarray was evaluated. The value of SD represents 
roughly the experimental error. In our data, we estimated 
the value of 0.5 as the critical point of the accuracy of ex- 
periments. We have evaluated the accuracy of those profiles 
on the base of the standard deviation of the expression ratio 
of all genes. 1 07 disruptants including 68 mutants where the 
transcription factors were disrupted could be selected from 
400 profiles. 

We used 100 microarrays and constructed a genetic net- 
work of 521 genes from the above data. The 94 transcription 
factors whose regulating genes have been clearly identified 
were found. The profiles of the 521 genes in control by 



those 94 factors were selected from 5871 profiles. 



Baslp and Bas2p also activate expression of three genes 
in the histidine biosynthesis pathway. In a gcn4 gack- 
ground, mutations that abolish the BAS1 or BAS2 function 
lead to a histidine auxotrophy. Previous investigation indi- 
cated that Baslp and Bas2p are DNA binding proteins re- 
quired for transcription of HJS4 and these ADE genes like 
GCN4 [8, 1 1 , 29]. In this paper, we made clear that both ge- 
netic relation. Figure 4 indicates that those ADE genes and 
histidine biosynethesis genes are related with BAS1 more 
directly than GCN4. The ribose component of purine ri- 
bonucleotides is derived from ribose 5-P, an inter mediate 
of the pentose phosphate cycle. The atoms of the base moi- 
ety are contributed by many compounds. They are added 
step wise to the preformed ribose. There exist striking in- 
terrelationships with the pathway for histidine synthesis. 

Studies on the regulation of the purine biosiynthesis 
pathway in Saccharomyces cerevisiae revealed that all the 
genes encoding enzymes required for AMP de novo biosyn- 
thesis are repressed at transcriptional level by the presence 
of extracellular purines. ADE genes are transcriptionally ac- 
tivated as well as some histidine biosynthesis genes. Espe- 
cially the fact that expression of HIS4 is related with ADE 
genes were known. In our regulated network, HJS4 were 
related with somv ADE genes closely, and some HIS genes 
are related with ADE genes like HIS4. The biosynthesis of 
the essential amino acid histidine shows in Saccharomyces 
cerevisiae shows close connection to purine metabolism, 
and our result satisfied this fact. 
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phosphoribosy L-AMP cydohydrolase/ 
phosphorlbosy L*ATP pyrophosphate/ 
histidinol dehydrogenase 



(NAPDPHXGOGAT) 



positive trau scriptioa transcription 13 
regulator of AR09 and for poohivc nitrogen 
ARO10 regulation 



Figure 5. The resulting partial network of the analysis of 521 Saccharomyces cerevisiae genes. 



95 



WO 03/027262 



PCT/US02/31093 



5. Conclusion 

In this paper we proposed a new statistical method for 
estimating a genetic network from microarray gene expres- 
sion data by using a Bayesian network and nonparametric 
regression. The key idea of our method is the use of non- 
parametric heteroscedastic regression models for capturing 
nonlinear relationships between genes and heteroscedastic- 
ity of the expression data. If we have a network that repre- 
sents the causal relationship among genes, we can simulate 
the genetic system on the computer, e.g., Genomic Object 
Net [26, 27]. In this stage, it is required that the relation- 
ships between genes are suitably estimated. In this sense, 
the proposed heteroscedastic model can give an essential 
improvement, because the previous models sometimes lead 
to unsuitable estimates of the systems. We consider the sim- 
ulation of biological system as a future work. 

An essential problem for network construction is the 
evaluation of the graph. We investigated this problem as 
a statistical model selection or evaluation problem and de- 
rived the new criterion for selecting graph from Bayes 
approach. Our method covers the previous methods for 
constructing genetic networks by using Bayesian networks 
and improves them in the theoretical and methodological 
senses. The proposed method successfully extracts the ef- 
fective information and we can find these information in 
the resulting genetic network visually. We use the simple 
greedy algorithm for learning network. However, this algo- 
rithm needs much time for determining the optimal graph. 
Hence, the development of a better algorithm is one of the 
important problems and we would like to discuss it in a fu- 
ture paper. 

We showed the effectiveness of our method through the 
Monte Carlo simulations and the analysis of Saccharomyces 
cerevisiae gene expression data and evaluated the resulting 
network by comparing with biological knowledge. We con- 
struct the genetic network without using biological informa- 
tion. Nevertheless, the resulting network includes many im- 
portant connections, which agree with the biological knowl- 
edge. Hence, we expect that our method can demonstrate its 
power in the analysis of a completely unknown system, like 
the human genome. 
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Example 4 

Statistical analysis of a small set of time-ordered 
gene expression data using linear splines 



ABSTRACT 

Motivation: Recently, the temporal response of. genes to 
changes in their environment has been investigated using 
cDNA microarray technology by measuring the gene 
expression levels at a small number of time points. 
Conventional techniques for time series analysis are not 
suitable for such a short series of time-ordered data. The 
analysts of gene expression data has therefore usually 
been limited to a fold-change analysis, instead of a 
systematic statistical approach. 

Methods; We use the maximum likelihood method 
together with Akaike's Information Criterion to fit linear 
splines to a small set of time-ordered gene expression 
data in order to infer statistically meaningful information 
from the measurements. The significance of measured 
gene expression data is assessed using Student's West. 
Results: Previous gene expression measurements of the 
cyanobacterium Synechocystis sp. PCC6803 were 
reanalysed using linear splines. The temporal response 
was identified of many genes that had been missed by a 
fold-change analysis. Based on our statistical analysis, we 
found that about four gene expression measurements or 
more are needed at each time point 
Contact: mdehoon@ims.u-tokyo.ac.jp 

1 INTRODUCTION 

In recent years, many cDNA microarray experiments have 
been performed measuring gene expression levels under 
different conditions. Measured gene expression data have 
become widely available in publicly accessible databases, 
such as the KEGG database (Nakao et aL y 1999). 

In some of these experiments, the steady-state gene 
expression levels are measured under several environmental 
conditions. For instance, the expression levels of the 
cyanobacterium Synechocystis sp. PCC6803 and a mutant 
have been measured at different temperatures, leading to the 
identification of the gene Hik33 as a potential cold sensor in 
this cyanobacterium (Suzuki et al r 2001 ). 

In other experiments, . the temporal pattern of gene 



expression is considered by measuring gene expression 
levels at a limited number of points in time. Gene expression 
levels that vary periodically have for instance been 
measured during the cell cycle of the yeast Saccharomyces 
cerevisiae (Spellman et al, y 1998), The gene expression 
levels of the same yeast species were measured during the 
metabolic shift from fermentation to respiration (DeRisi et 
at* 1997). In this experiment, the environmental conditions 
were changing .slowly over time. Conversely, the gene 
response to an" abruptly changing environment can be 
measured. As an example, the gene expression levels of the 
cyanobacterium Synechocystis sp. PCC 6803 were measured 
at several points in time after a sudden shift from low light 
to high light (Hihara, 2001). 

In cDNA microarray experiments, gene expression levels 
are typically measured at a small number of time points. 
Conventional techniques of time series analysis, such as 
Fourier analysis or autoregressive or moving-average 
modelling, are not suitable for such a small number of data 
points. Instead, the. gene expression data are often analysed 
by clustering techniques or by considering the relative 
change in the gene expression level only. Such a fold-change 
analysis may miss significant changes in gene expression 
levels, while it may inadvertently attribute significance to 
measurements dominated by noise. In addition, a 
fold-change analysis may not be able to identify important 
features in the temporal gene expression response. 

Several techniques to analyse gene expression data, such 
as deriving Boolean or Bayesian networks, have been 
proposed in the past (Liang etal, 1 998; Akutsu et al t 2000; 
Friedman et aL; 2000, Imoto et al, 2002). Whereas 
describing gene interactions in terms of a regulatory 
network is very important, deriving a network model 
requires gene expression data at a large number of time 
points, which is currently often not yet available. It should* 
be noted that the number of genes is on the order of several 
thousands,' while the gene expression levels are often 
measured at only five or ten points in time. 

So far, a systematic method has been lacking to 
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statistically analyse gene expression measurements from a 
small number of time-ordered data. In this paper, we will 
outline a strategy based on fitting linear spline functions to 
time-ordered data using the maximum likelihood method 
and Akaike's Information Criterion (Akaike, 1971). The 
significance of the gene expression measurements is 
assessed by applying Student's f-test. This allows us to infer 
information from gene expression measurements while 
taking the statistical significance of . the data into 
consideration, This kind of analysis should be viewed as a 
first step toward building gene regulatory networks. As an 
example, we reanalysed the gene expression measurements 
of the cyan obacteri urn Synechocystis sp. PCC 6803 (Hihara, 
2001). It is shown that information can be inferred from the 
measured data that is missed when considering the 
fold-change only. By repeating our analysis with a'subset of 
the available data, we were able to determine how many 
measurements are needed at each time point in order to 
estimate the linear spline function reliably. 

2 METHODS 

2.1 Student's Mest 

First, we would like to assess if the measured gene 
expression ratios are significantly different from unity. If for 
a specific gene we can conclude that for all time points the 
measured expression ratios are not significantly different 
from unity, we can eliminate that gene from farther analyses. 
The significance level can be established by applying 
Student's f-test for each time point separately; Since multiple 
comparisons are being made for each gene, the value of the 
significance level <x should be chosen carefully. 

We define Ho (0 as the hypothesis that for a given gene the 
expression ratio is equal to unity at a given time point t h and 
Ho as the hypothesis that for a given gene the expression 
ratios at all time points are equal to unity. If we denote #as 
the significance level for rejection of hypothesis Ho, and a ' 
as the significance level for rejection of hypothesis Ho (,) » 
then cx f and or are related via 



i-«=(i-«')\ 



(i) 



in which a is the number of time points at which the gene 
expression ratio was measured. Note that by expanding the 
right hand side in a first order Taylor series, this equation 
reduces to Bonferoni's method for adjusting significance 
levels (see also Anderson and Finn, 1996). 

By performing Student's Mest at every time point for 
each gene, using a' as the significance level, we will find 
whether Ho° and therefore Ho should be rejected. If Ho is 
not rejected, we can conclude that mat gene is not 
significantly affected by the experimental manipulation, and 
should- therefore not be included in further analyses. If for a 
given gene the null hypothesis Hq is rejected, we conclude 
that that gene was significantly affected by the experimental 
manipulation. 



2.2 Analysing time-ordered data using linear splines 
Next, we analyse the temporal gene expression response for 
genes that were found to be significantly affected. The 
measured gene expression ratios form a small set of 
time-ordered .data, to which we can fit a linear spline 
function. A linear spline function is a continuous function 
consisting of piecewise linear functions, which are 
connected to each other at knots (Friedman and Silverman, 
1989; Higuchi, 1999). Whereas cubic splines are used more 
commonly, for the small number of data points we are 
dealing with linear spline functions are more suitable. A 
conceptual example of a linear spline function with knots 
t * , t\, tj* , t* 4 is shown in Figure I. 

Consider a set of data points (t Jy xj) t j e {1,.,., «}. We 
wish to fit a nonparametric regression model of the form 



(2) 



to these data, in which g is the linear spline function and § is 
an independent random variable with a normal distribution 
with zero mean and variance c*\ 

We estimate the linear spline function, g using the 
maximum likelihood method. The probability distribution of 
one data pointy, given tj, is 



/(4,;,,^)= 7 ^ r e, P 



(3) 



2<x z 



The log-likelihood function for the n data points is then 
given by 

L^h-^n[^}-^-t( Xi - S ( ti )y. (4) 

The maximum likelihood estimate of the variance <? can 
be found by maximising the log- likelihood function with 
respect to cF. This yields 




♦• : measured data point 
X:knot 



'* h h r * Timer 

Fig. 1. A conceptual diagram of a linear spline function fitted to 
measured data. 
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(5) 



The Iog-liTce1ihood Sanction ran then be written in the form 

The maximum likelihood estimate. J of the linear spline 
function g can now be found by minimising c 2 . It can be 
shown that the- minimum value of c 2 will be achieved if 
the linear spline function is chosen such that 



(7) 



in which g is a vector containing the q values of the linear 
spline function at the knots t* > A is a tridiagonal symmetric 
matrix given by 



(8) 



f or 1 < / < 



(9) 
(10) 



f.t,<tj-a M fM~ l iJ 

for l£i<# 



and is a vector given by 



(id 



(12) 



for \<i<q\ 



C13) 
(14) 



The fitted model depends on the number of knots g. The 
number of knots can be chosen using Akaike's Information 
Criterion, known as the A1C (Akaike, 197 1; see also 



Priestley, 1994) 



AlC = -2L{g,o z Y 2fe + l), 



(15) 



where q + 1 is the number of estimated parameters ( c 2 and 
the q entries in the estimated vector J). Substituting the 
estimated log-likelihood function from equation (6), we find 



AIG = n In \l71a 2 ] ^ n -f 2 g -h 2 , 



(16) 



in which c 2 is given by equation (5) after substitution of 
the maximum likelihood estimate g for the linear spline 
function g. 

For each value of q t we will calculate the value of theA/C 
after fitting the linear spline function as described above, 
and select the value of q that yields the minimum value of 
the AJC. The case q - 2 corresponds to linear regression. For 
the special case q ~ l y we are effectively fitting a flat line to 
the data. If we find that for a particular gene, the minimum 
AlC is achieved for the constant function (q = 1), then we 
can conclude that the expression level of that gene was 
unaffected by the experimental manipulations. 

Gene expression data are typically given in term of 
expression ratios. At time zero, the expression ratio is equal 
to unity by definition. This fixed point can be incorporated 
easily in our methodology by modifying equation (7). It can 
be shown that the minimum value of c 2t will now be 
achieved by choosing the linear spline function such that 
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in which 2£=fc 2 - A^and £,=1* 

3 RESULTS 

3,1 Student's /-test 

We will illustrate using Student's f-test and fitting linear 
spline functions by reanalysing the measured gene 
expression profile of the cyanobacterium sp. PCC 6803 after 
a sudden exposure to high light (HL) (Hihara et al 9 2001). 
The expression levels of 3079 OKFs were measured at zero, 
fifteen minutes, one hour, six hours, and fifteen hours both 
for cyanobacteria exposed to HL and cyanobacteria that 
remained in the low light (IX) condition. Table 1 shows the 
number of measurements at each time point Data from the 
cDNA expression measurements were obtained from the 
KEGG database (Nakao et a/., 1999). 
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Table l.The gene expression measurements as a time-ordered set of data. 



Time point 


Number of measurements 


1 = 15 minuies 




i ~ 1 hour 


6 


; = 6 hours 


4 


/ = 15 hours 


4 


(*): Two sets of measurements are missing in the KBGG database at the t - 15 


minutes time point. 




Table Z Student's .-test of gene expression measurements (3079 ORFs total). 


Significance level 


Number of ORFs 


p < 0.0003 


98 


0.0003 <p< 0.001 


69 


0.001 <p< 0.01 


320 


0.01 <p< 0.05 


517 


p>0.05 


2075 



It should be mentioned that the data used for the original 
analysis (Hihara, 2001) may not be identical to the raw data 
submitted to KEGG (Hihara, personal communication). In 
addition , two measurement sets out of six at the t = 15 
minutes time point are missing in the KEGG database. 
Recalculating the gene expression ratios from the raw data 
gives numbers close to the previously published results 
though. ' 

After subtracting the background signal intensities from 
thi" HL and IX raw data, global normalisation was applied 
and the ratio of the HL to the LL signal intensities was 
calculated to find the relative change in the gene expression 
level with respect to the control (LL) condition. In the 
fold-change analysis, a gene was regarded as being affected 
by HL if its expression level changed by a factor of two or 
more. The statistical significance of such changes was 
assessed heuristically by considering the size of the standard 
deviation of the measurements (Hihara, 2001). 

The results of Student's /-test on the gene expression 
ratios of each gene separately are shown in Table 2. At a 
significance level of a — 0.001, 167 genes were found to be 
significantly affected by HL condition. Note that we would 
expect about three type-1 errors among these 167 genes. In 
comparison, 164 ORFs were found to be affected by the HL 
condition in the original analysis (Hihara, 2001). 

By considering the fold-change for the psbD2 gene 
(slr0927)> it was concluded that it was not significantly 
induced by HL (Hihara, 2001). This was remarkable, since 
this gene had been reported to be inducible by HL in the 
cyanobacteriurn Synechococcus sp. PCC 7942 (Bustos and 
Golden, 1992; Anandan and Golden, 1997> However, 
performing Student's Mest on the gene expression data for 
the psbD2 gene at t = 6 hours yields p = 3.3 x 10~ 5 , 
suggesting that this gene was indeed affected by HL. 

3.2 Analysis using linear spline functions 

Next, we fit linear spline functions to the measured gene 
expression ratios. The number of knots q will be between 
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orie and five, with a fixed knot equal to unity at time zero. 
For q = 3 and q = 4 T three possibilities exist for the 
placement of the knots between the linear segments of the 
linear spline. These are indicated in Figure 2, together with 
the cases q - 1, q = 2, and q = 5. Notice that the number of 
possible knot placements increases exponentially with the 
maximum number of knots q^ x as H 2 W ' 2 . 

In the fold-change analysis, the temporal gene expression 
patterns were classified into six categories (Hihara, 2001), 
listed in Table 3. Fitting a linear spline function to the 
measured gene expression data provides a move flexible 
way to describe the data than categorising. In addition, a 
numerical description of the gene expression response 
pattern is an important initial step in deriving a gene 
regulatory network. 
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Fig. 2. Possible placement of knots for a time-ordered set of measurements at 
five time points. 

Table 3. Six categories were used in the fold-change analysis to classify the 
temporal pattern of gene expression (Hihara, 20OJ). 



TypeJ 


Induced within 15 manures, then decreased 


Type 2 


Induced continuously ai high levels 


Type 3 


Induced at approximately one hour 


Type 4 


Repressed within 15 rrnnuies, then increased 


Types 


Repressed continuously at low levels 




Repressed at approximately one hour 
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We will illustrate the usage of the A1C with an example. 
In the fold-change analysis, the threorine synthase gene thrC 
(sU168S) was found to be repressed af approximately one 
hour. The calculated values of the A1C for the different sets 
of knots are listed in Table 4. The minimum AJC was 
achieved for knots at 0, 15 minutes, 1 hour, and 15 hours. 
Figure 3 shows the measured gene expression levels, 
together with the linear spline that was Fitted to the data. 

Performing this procedure for all the jene expression 
measurements lets us classify the different genes based on 
their time-dependent response to HL- Several choices can be 
made as to which genes to include in this analysis. In the 
original analysis, genes were removed from the calculation 
if their expression levels were among the 2000 lowest out of 
3079 ORFs (Hihara, 2001). Alternatively, we can eliminate 
genes if Student's r-test showed that they were not 
significantly affected by HL. Table 5 shows the number of 
genes whose measured expression levels correspond to each 
response pattern for these different cases. For Student's r-test, 
a significance level or = 0.001 was used. 



Table 4. The calculated AJC for the different sets of knots for the ihreorint 
synthase gene thrC {$111688). 



Placement of knots 


Arc 


Flat line 


62J5 


(0, 15 hours) 


64.54 


(0, 15 minutes, 15 hours) 


64.54 


(0, 1 hours, 15 hours) 


66.24 


(0, 6 hours, 15 hours) 


66.38 


(0, 15 minutes, 1 hours, 15 hours) 


25.24 


(0, 15 minutes, 6 hours, 15 hours) 


64.69 


<0, 1 hour, 6 hours, 15 hours) 


68.23 


<0, 15 minutes, 1 hour, 6 hours, 15 hours) 


26.45 




Fig. 3. The measured gene expression levels for the threorine synthase gene thrC 
{sill 688}, together with the fitted linear spline. 



Table 5. The number of genes for each response partem. 



Placement of biots Number of genes 





Keeping 


Removing 


Using r-test to 




all genes 


lowest 2000 


remove genes 


Hat line 


927 


249 


1 C) 


(0, 15 hours) 


280 


69 


3 


(0, 15 minutes, 15 hours) 


663 


217 


61 


(0, 1 hours, 15 hours) 


336 


178 


7 


(0, 6 hours, 1 5 hours) 


160 


64 


3 


(0, 15 minutes, 1 hours. 


296 


119 


27 


1 5 hours) 








(£>, 15 ralnmes^ 6 hours, 


173 


83 


39 


15 hours) 








(0, 1 hour, 6 hours, 15 


103 


49 


12 


hours) 








(0, 15 minutes, 1 hour, 6 


86 


51 


14 


hours. 15 hours) 









(*): Removed because of the presence of an outlier. 



We can compare the 167 genes which were identified by 
Student's /-test as significantly affected by HL with the 
results from the fold-change analysis (Hhara, 2001),- in 
which 164 ORFs were identified. First, we remove those 
genes from our analysis for which an outlier was present in 
the data. We define an outlier as a data point that deviates 
more than two standard deviations from the mean of the data 
at a given time point Only^one gene was found for which 
the measured expression data contained an outlier; the linear 
spline function fitted to its expression data was a flat line. 
None of the other gene expression levels was described by a 
flat line, which is consistent with the results from Student's 
/-test. 

Next, we remove those genes whose expression level was 
among the lowest 2000 in order to avoid using data that are 
predominantly determined by noise. The same procedure 
had been used for the fold-change analysis (Hihara, 2001). 
After removal of these genes, 107 genes remained that were 
significantly affected by HL. 

Of the 107 genes, 42 had not been identified m the 
fold-change analysts (Hihara, 2001). These genes are listed 
in Table 6, together with the location of the knots that was 
found for each gene. For each linear spline function the 
percentage variance explained was calculated as a measure 
of the goodness of fit. As an example, Figure 4 shows the 
measured gene expression ratio as well as the fitted linear 
spline function of the gene xylR {slr0329\ having four knots 
at zero, fifteen minutes, one hour, and fifteen hours. Of the 
42 genes, the gene xylR (slr€329) had the largest percentage 
variance explained (98.7%). 

Of the 164 ORFs that were identified in the fold-change 
analysis, 39 were not significantly affected by HL according 
to Student's r-test at a significance level a = 0.00 1. These 
ORFs are listed in Table 7. 
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Table 6. ORFs that were significantly affected by HL,- but had not been 
identified in the fold-change analysis (Hihara, 2001}. 



Table 7. ORFs that wexe identified to be affected by HL in the fold-change 
analysis (Hihara, 2001), but found not significant by Student's /-test 





Gene 




% variance 
explained 


SI11330 




{0, 13 nun., 15 hours) 


91 .8 


sI11797 


ycfZl 


(0, 15 min-, 1 5 hours) 




slr0121 




(0, 15 mm., 15 hours) 


78.7 


slr0343 


petD 


(0 r 15 min., 1 5 hours) 


935 


slr0442 


(0. 15 min., 15 hours) 


81.! 


slr0769 




(0: 15 rain.. If sours) 


64.7 


Slr0906 


psbB 


(0, 15 rrrin., iz hours) 


94.6 


sir) 128 




(0, 15 rnio., 15 hours) 


63.7 


slrll52 




(0, 15 mia, 15 hours) 


60.7 


slrl277 


gspD 


(0, 15 rrrin., 15 hours) 


40.5 


Slrl610 




(0, 15 min., 15 hours) 


70.2 


slrl813 




(0, 15 mia, 15 hours) ' 


843 


5 lr2052 




(0, 15 min., 15 hours) 


87.1 


S110144 


pyrHor 
srobA. 


(0, 1 hour, 15 hours) 


88.5 


S11I327 


atpC 


(6, 1 hour, 15 hours) 


703- 


siU496 




(0, 1 how, 15 hours) 


77.4 


slrl367 


ElgP or 


(0, 1 hour, 15 hours) 


88.2 






([0, 1 hour, 15 hours) 




slrl793 


taiB 


87.0 


sill 878 




(0, 6 hours, 15 hours) 


65.8 


slrl600 




(0,6 hours. 15 hours) 


46.6 


SU0851 


psbC 


(0, 15 min., 1 hour, 15. hours) 


91,9 


S1U471 


cpcG 


(0, 15 rninJ, 1 hour, 15 hours) 


96.6 


SI11812 


tps5 


(O, 35 min., I hour, 15 hours) 


603 


slr0076 




(0, 15 mm., 1 hour, 15 hours) . 


85.8 


slr0329 


xylR 


(0, 15 min., I hour, 15 hours) 


98.7 


Slr0927 


psW2 


(0, 15 min.. 1 hour, 15 hours) 


48.9 


5M513 




(0, 15 rniu., 1 hour, 15 hours) 


84,8 


S110547 




(0, 15 min., 6 hours. 15 hours) 


793 


. S1I1528 . 




(□, 15 min., 6 hours, 15 hours) 


905 


slr0056 


C4 


(0, 15 min., 6 hours, 15 hours) 


76.6 


slr0161 


pilT 


(0, 15 min., 6 hours, 15 houjs) 


96.2 


slrl239 


pntA 


(0, 15 min-, 6 hours, 15 hours) 


96.7 


slrl545 


(0, 15 rnin., 6 hours, 15 hours) 


96.7 


sir] 799 




(0, 15 min., 6 hours. 15 hours) 


955 


SI10617 


lrn30 


(0, 1 hour, 6 hours, 1 5 hours) 


84.9 


SH0921 




(0, 1 hour, 6 hours, 15 hours) 


46.9 


S1I0985 




(0, 1 hour, 6 hours, 15 hours) 


89.2 


slrl237 


cod A 


(0. 1 hour, 6 hours, 15 hours) 


80.7 


siw 2 os 


arorl 


(0, 15 min. I hoar/ 6 hairs. 15 hours) 


84.4 


S110185 




(0, 15 min, 1 hour, 6 hours, 15 hours) 


94.9 


S111873 




(0, 15 irrin., I hour, 6 hours, 15 hours) 


92.0 


slr0116 


beinD 


(0, 35 nrirL, 1 hour. 6 hours, 15 hours) 


89.2 



1.2 



. 1.0 - 

© 

e as - 




Fig. 4. The measured gene expression levels for the gene xylR (slr0329\ 
together with the fitted linear spline. This gene was considered to be umffected 
by HL in the fold-change analysis (Hihara, 2001 ). 



slrl3ll 


sill 867 


s»1732 


si] 1733 


si] 3734 


slrl280 


S1J0170 


S110430 


S111514 


slrl641 


slrl350 


- slrl516 


slr0228 


sir! 604 


slrI675 




s!11483 


S1U541 


SII20I2 


sir0394 


slr0426 


slrl351 


slrl594 


S110217 


s!10218 


£110219 


sll0528 


SII066S 


SU0&14 


S1J0S46 


sill 884 


slr0007 


slr0476 


slr0740 


slr0959 


shl544 


slr!674 


sir 1686 


slrl963 



Table 8. Reliability of the linear spline fitting procedure as a function of number 
of data used. Only those genes were considered which were significantly 
affected by HL according to Student's Mest, and whose data did not contain any 
outliers. 



Using four data points at 15 min., 6 hours, 


linear spline functions are 


and 15 hours, and six data points at 1 hour 


estimated for 166 genes 


Using four data points at each time point 


25 ± 8 estimated differently. 


Using three data points at each time point 


53 ± 10 estimated differently 


Using two data points at each time point 


79 ± 17 estimated differently 



Finally, we would like to establish whether the number of 
measurements at each time point was sufficient to. reliably 
determine the placements of knots for the linear spline 
function. To do so, we repeated the estimation of the linear 
spline function using subsets of the measured data. We then 
counted for how many genes the estimated knot positions 
changed if a subset of the data was used instead of the 
complete set of data. The average and standard deviation of 
this number for four, three, and two data points at each time 
point is shown in Table 8, 

Even if only two data points (at the one how time point) 
are removed, and four data points are used at each time point, 
in 15% of the cases the estimated knot positions change. 
This suggests that four or more data points are needed for 
each time point to reliably deduce information from gene 
expression measurements. 

4 DISCUSSION 

We have described a strategy based on maximum likelihood 
methods to analyse a set of time-ordered measurements. By 
applying Student's /-test to the measured gene expression 
data, we first establish which of the measured genes are 
significantly affected by the experimental manipulation. TTie 
expression responses of those genes are then described by 
fitting a linear spline function. The number of knots to be 
used for the linear spline function is determined using 
Akaike's Information Criterion (A/Q. 

Using linear spline functions allows us more flexibility in 
describing the measured gene expression than using a 
nominal classification. Also, in order to set up a gene 
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regulatory network, it is important that the gene response as 
determined from gene expression measurements is available 
in a numerical form. Finally, the positions of the knots 
specify those time points at which the expression of a gene 
changes markedly, which is important in identifying its 
biological function. As a next step, the classification of gene 
expression responses based on the position of the knots can 
be refined by creating subcategories that take the magnitude 
of the linear spline function at the knots into account. For 
instance, for linear spline functions with three knots we may 
consider creating six subcategories in which changes in the 
gene expression level are described by (flat, increasing), 
(flat, decreasing), (increasing, flat), (decreasing, flat) 
(increasing, decreasing), or (decreasing, increasing). 

Applying the technique of linear spline functions to 
measured gene expression data, we were able to identify the 
temporal expression response pattern of genes that were 
significantly affected by the experimental manipulations. 
The response of 42 of those genes was not noticed in earlier 
fold-change analyses of expression data- Furthermore, it was. 
shown that the expression response levels found in a 
fold-change analysis were not found to be significant by 
Student's r-test for 33 out of 164 genes some genes. 

Gene expression data tend to be noisy and are often 
plagued by outliers. "Whereas Student's /-test and maximum 
likelihood methods described here take the statistical 
significance of noisy data into account, the issue of outliers 
needs to be addressed separately. As a simple procedure to 
remove outliers, we calculated the mean and standard 
deviation of the data for each point in time, and removed 
those data that deviate more than two standard deviations or 
so from the mean. 

Finally, the number of expression measurements needed 
at each time point to reliably fit a linear spline function was 
determined by removing some data points and fitting a 
linear spline function anew. It was found that if four data 
points per time point are used, in about 15% of the cases the 
knot positions will not be estimated reliably. It is therefore 
advisable to make more than four measurements per time 
point 
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Example 5 



USE OF GENE NETWORKS FOR IDENTIFYING AND 
VALIDATING DRUG TARGETS 



We propose a new methodology for Identifying and validating drug targets by using 
gene networks, which are estimated from cDNA microarray gene expression data. 
We created novel gene disruption and drug response rnicroarray gene expression 
data libraries for the purpose of drug target elucidation* We use two types of 
rnicroarray gene expression data for estimating gene networks and then identifying 
drug; targets. The estimated gene networks play an essential roJe in understanding 
drug response data and this information is unattainable from clustering analysis 
methods, which are the standard for gene expression analysis. . In the construction 
of gene networks, we use both Boolean and Bayeaian network models for different 
steps rn Analysis and capitalize on their relative strengths- We use an actual 
example from analysis of the Saccharornyizt* Cere virtue gene expression aod drug 
response data to suggest a, concrete strategy for the application of gene network 
information to drug discovery. 



1 Introduction 

In recent years, the development of rnicroarray technology has produced a 
large volume of gene expression data under the various experimental con- 
ditions. Constructing gene network from gene expression data has received 
considerable attention and several methodologies for gene network inference 
have been proposed in the fields of computer science and statistics, e.g., see 
the paperf A ? A *> 11 > 13 » J4 » l6 ' aw * in the References section. The development of 
new methods for constructing a gene network ia import and should progress. 
On the other hand, real world application of estimated gene networks to solve 
medical or biological problems has recently become the center of interest. In 
this paper, we introduce a methodology for identifying and validating drug 
targets based on gene expression data. 

Clustering mcthod^* 7 *^ 3 have become a standard tool for analyzing mtr 
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croarray gene expression data. However^ they cannot offer information suffi- 
cient for identifying drug targets in either a theoretical or practical sense. In 
this paper, we show how estimated gene networks can be used as a key for 
identifying and validating target genes for the understanding and development 
of new therapeutics. Gene regulation pathway information is essential for our 
purpose, and we use both Boolean and Bayesian networks as the methods 
for inferring gene networks from gene expression profiles. The procedure for 
identifying drug targets can be separated into two parts. At first, we should 
identify the drug-affected genes. Second, we search for the druggable genes, 
which are usually upstream of the drug-affected genes in the gene network. The 
Boolean network model shows its most useful for identifying the drug-affected 
genes by using "virtual gene" technique, proposed herein, and the Bayesian 
network model can work effectively for exploring the druggable gene targets 
related to the elucidated affected genes. We apply the proposed method to 
novel Saccharvmyces ccremstae gene expression data, comprised of epepression 
experiments from 220 gene disruptions and several dose and time responses 
to a drug. We demonstrate a concrete strategy for identifying genes for drug 
targeting through this application study. 

2 Gene [Networks for Id entity rag Drug Targets 
2.1 Clustering Methods 

Clustering methods such as the hierarchical cuxsterini? and the self-organizing 
maps^ 3 are commonly used as a standard tool for gene expression data analy- 
sis in the field of Bioinformatics. Eieen 5 focuses on the hierarchical clustering 
and provides software, Cluster /Tr ee Ve w , for clustering analysis of gene expres- 
sion data. De Hoon et a/?' 8 improves this software especially in the fc-means 
clustering algorithm. These clustering algorithms can work effectively in the 
early stage of analysis. In the clustering analysis of gene expression data, we 
assume that the genes, which have similar expression patterns) play the same 
functional role in cell. Therefore, we often use the clustering methods for 
predicting gene functions. 

We agree with the usefulness of the clustering methods, however, they 
cannot provide sufficient information for our purpose. Clustering methods only 
provide the information on gene groups via the smrilarity of the expression 
patterns. In fact, we need additional hierarchical pathway information, not 
only cluster information, in order to detect drug targets that are affected by 
an agent. In Section 3 7 we show the limitations of clustering techniques for 
drug targeting purposes through real data analysis. 
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2.2 Network Methods 



We use two methods for estimating gene networks from gene expression data. 
Xn this subsection, we give the brief introductions of both methods. For detailed 
discussions of the algorithms, please refer to the papers l » 2 > 3 '** a > 11 * 13 ' 14,1 *' 20 » 22 in 
the References, section. 

Boolean Network 

A number of groups 1 ,2,3 > 4,16j22 have proposed Boolean network models as a gene 
network construction method. To estimate a Boolean network model, we must 
discretize the gene expression values into two levels, 0 (not-expressed) and 1 
(expressed). Suppose that Ui,„./tt* are input nodes of a node v. The state of 
v is determined by a Boolean function /*, C0O«i)>— » ^( u *))> where i>(ut) is the 
state or the expression pattern of If we have time series gene expression 
data, the state depends on time t and the state of node at time t depends on 
the states of inputs at time ( — 1. On the other hand, suppose that we have 
gene expression data obtained by gene disruptions. Akutsu et at} proposed 
the theory and methodology for estimating a Boolean network model without 
time delay. Maki et al± s provides a system, named AIGNET, for estimating a 
gene network based on the Boolean network and an S-system? 1 . In this paper, 
we use the AIGNET system for estimating Boolean network models. 

The advantages of the use of the Boolean network model are as follows: 
a) This model ia very simple and can be easily understood, b) the Boolean 
network model can detect the parent-child relations correctly, when the data 
has sufficient accuracy and information and c) the estimated Boolean network 
model can be directly applied to the Genomic Object Netf 0 * 18 ' 11 *, a software tool 
for biopathway simulation. The negative side of the Boolean method is that 
we must discretize the expression values into two levels, and the quantization 
probably causes information loss. Moreover, the threshold for discretizing is a 
parameter and should be chosen by a suitable criterion- 

Bayesian Network 

The Bayesian network 5 is a graph representation of the complex relations of 
a large number of random variables. We consider the directed acyclic graph 
with Markov relations of nodes in the context of Bayesian network. We can 
thus describe complex phenomena through conditional probabilities instead 
of the joint probability of the random variables. That is, suppose that wc 
have a gene expression data x# of i-th axray^ and 7-th gene for i = 1,...,7* 
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and j = 1, ...,p, we have f{x& , x ip ) » A(»nb»i) x *• • x / p (xt,|p^), where 
p {J sb (pjf, is a gj-- dimensional parent observation vector of ar#. 

Friedman et ol? proposed an interesting approach for estimating a gene 
network from gene expression profiles. They discretized the expression val- 
ues into three values and used the multinomial distributions as the condi- 
tional distributions of the Bayesian network. However, this did not solve tbe 
problem of choosing threshold values for chscretteing. Imoto et aZ* 3 » 14 re- 
cently used the nonp&rametric regression model that offers a solution that 

does not require quantization x*j == mjttPii) + H + ^>7> where 

dj depends independently and normally on mean 0 and variance <t\* and 
rrijje. (-) is a smooth function constructed by ^-splines of the form «V*(pjjp) = 
TlHi 7&*%L(P&) for k = 1,...,©. Here 7 $ -y^* arc unknown co- 

efficients and aO are ^-splines- Lnoto et a/? 3 * 14 proposed the 

nonlinear Bayesian network model of the form 

/(*,; 0) = n «-Pl- W - f: £ 7S^l(pH } )} 2 /2cr|]/ V^^> 

together with new criterion, named BNRC, for selecting an optimal graph. The 
BNRC is defined as the approximation of the posterior probability of the graph 
by using the Laplace approximation for integrals. Jhioto et aZ. 13>14 applied the 
proposed method to the Sacchnromyces cerevi$iac gene expression data and 
estimated a gene network. The advantages of this method are as follows^ a) 
We can analyze the microarray data as a continuous data set, b) this model 
can detect not only linear structures but also nonlinear dependencies between 
genes and c) the proposed criterion can optimize the parameters in the model 
and the structure of the network automatically. 

The Bayesian network has theoretical advantageous base on the mathe- 
matics of inference and in this paper, we use the method proposed by Lnoto 
et af- 3,X4 for constructing Bayesian networks. Unfortunately, the Bayesian 
network cannot construct cyclic regulations and are not useful for creating 
multilevel directional models of regulatory effects from data created from log- 
ical joins of expression data from disruptants and drug response experiments. 
However, we can get around this problem by combining the Bayesian and 
Boolean networks in analysis- Hence, the Boolean and Bayesian networks used 
together can cover the shortcomings one another and we can obtain more re- 
liable information by using both network methods. 
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3 Application 



$.1 Microarray Data 

We created two libraries of microarray data from the Saccharomyces ccrvvisiae 
gene expression profiles. One was obtained by disrupting 120 genes, and the 
another was comprised of the responses to an oral antifungal agent (four con- 
centrations and five time points). We selected 735 genes from the yeast genome 
for identifying drug targets. In YPD, 314 genes were defined as "Transcription 
factors", and 98 of these have already been studied for their control mecha- 
nisms. The expression profile data for 735 genes chosen for analysis included 
the genes .controlled by these 98 "Transcription factors 7 ' from 5871 gene Species 
measured in addition to nuclear receptor genes which have a pivitol role in gene 
expression regulation and are popular drug targets. We constructed the gene 
network "models from the data set of 735 genes over 120 gene disruption con- 
ditions. The details of the disruption data are also described in Imoto et aZ? 4 . 

As for the drug response microarray gene expression data, we incubated 
yeast culures in dosages of 10, 25, 50 and 100 mg of an antifungal medication 
in culture and took aliquota of the culture at 5 time points (0, 15, 30, 45 and 
60 minutes) after addition of the agent- Here time 0 means the start point 
of this observation and just after exposure to the drug. We then extracted 
the total UNA from these experiments, labeled the KNA with cy5> hybridized 
them with cy3 labeled UNA from non-treated cells and applied them to full 
genome cDNA microarray s thus creating a data set of 20 microarrays for drug 
response data. In this paper, we use these 140 microarrays to elucidate drug 
targets using gene networks. 

3.2 Result 
Clustering Result * 

In the identification of the drug targets, the popular but problematic strategy is 
the use of clustering analysi^ 2 » x7 often even with a library of base perturbation 
control data, to compare to drug response data based on the correlations. We 
have two types of microarray data, gene disruption and drug response, allow- 
ing us to compare drug response patterns to gene expression patterns caused 
by disruption. In the clustering analysis, if there is significant and strong 
similarity between the expression patterns of a single disruptant or group of 
disruptants and a given drug response microarray, we may conclude that an 
agent probably plays the same role as the disrupted gcne(s)» Moreover, if this 
disrupted gene has known functional role, we may obtain more information 
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about the response to a drug. 

Unfortunately, as 18 often the case "with such experiments we cannot gain 
such a straightforward result from clustering our data. Figure 1 shows the 
image representation of the correlation matrix of our microarray data. We - 
combine the two types of data and make the matrix Z — (X ; Y)> where X 
and y are the drug response and the gene disruption microarray data matrix, 
respectively. Here, each column denotes an expression pattern obtained by one 
microarray, and each column is standardized with a mean of 0 and variance of 1- 
Therefore, Figure 1 is the information of the correlation matrix R — Z T Z]p> 
where p is the number of genes and is 735. 

In Fjgure 1, the light and dark colors denote the positive and negative high 
correlations, respectively. It is clear that the drug response microarrays have 
high correlation amongst each other and a low correlation with any of the gene 
disruption microarrays. Under such a situation, we cannot find any interactions 
between gene disruptions and drug responses from the clustering analysis that 
would be useful to identify the strongly affected genes that ate meaningful in 
the context of drug response. We further implemented hierarchical clustering 
of the gene disruption and drug response microarrays, but the drug response 
microarrays produce one cluster and we cannot extract any more information 
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on what the actual drug targets from this analysis. This result is essentially 
unchanged when we use other distance measurements or clustering techniques. 
Hence, we cannot obtain the information for identifying the meaningful drug 
targets from the clustering methods with our data. 

Boolean Network Result 

We estimated gene network hy using the microarray data, Z y which was created 
by the combining gene disruption and drug response microarray 3. We introduce 
a method called virtual gtnc technique. We consider the conditions of the drug 
responses data as "virtual genes", e.g., the condition lOOmg/ml and 30min is 
given an assignment as the gene YEXPlOOmg30min. By using the* Boolean 
network model, we can find the child genes of these virtual genes with the 
drug affecting these child genes in progeny generational order. We focus on 
genes which have five or more virtual genes as the parent genes, as the putative 
drug-affected genes, that is, genes which are under direct influence of the virtual 
genes (drug affect). However, a gene which has only one virtual gene as its 
parent gene, may be the primary drug-affected gene, depending on the mode of 
action for a given agent and this must be analyzed on a case by case basis. The 
virtual gene technique highlights the advantages of Boolean network models 
compared with the Bayesian network model in the initial screening for genes 
under drug-induced expression influence. 

• In addition, fold-change analysis can provide similar information to the 
proposed virtual gene technique. In fact, we can obtain the affected genes under 
certain experimental condition by fold-change analysis. However, our virtual 
gene technique can improve the result of the fold-change analysis. Suppose that 
we find gene A and B are affected by the drag from fold-change analysis. The 
fold-change analysis cannot take into account the baseline interactions between 
gene A and gcneB. Than is, if there is a regulation pathway between gene A and 
geneB that gene A —» geneB, the geneB is probably not affected by the drug 
directly. The virtual gene technique can take into account such interaction by 
Vising the information of the gene disruption data and thus reduce the search 
set to more probable target genes given available interaction data. 

There is not guarantee that genes that are most affected by an agent are 
the genes that were "drugged" by the agent, nor is there any guarantee that 
the drugged target represents the most biologically available and advantageous 
molecular target for intervention with new drugs. Thus, even after identifying 
probable molecular modes of action, we should find the most druggable genes 
upstream of the drug-affected genes in a regulatory network and to then screen 
low molecular weight compounds for drug activity on those targets. In the 
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figure 2: The down&trwam pathway of the virtual pene TEXP100mg30jniai. 



estimated Boolean netv^ork, the virtual genes should be placed on the top of 
the network. Therefore, it is difficult or sometimes impossible to find upstream 
information for drug-affected genes in this estimated Boolean network. At this 
stage, we can use the Bayesian network model for exploring the upstream 
region of the drug-affected genes an an effective manner/ 

Bayesian Network Result 

The gene network is estimated by the Bayesian network and nonparametric 
regression method together with BNB.C optimization strategy 13 ' 14 - "We use 
the Saccharomyccs ccrevidiae microarray gene expression data obtained by 
disrupting 120 gen en. From the Boolean network analysis, we can find the 
candidate set of the drag-affected genes effectively. The druggable genes are the 
drug targets related to these drug-affected targets, which we want to identify 
for the development of novel leads. We can explore the draggable genes on the 
upstream region of the drug-affected genes in the estimated gene network by 
Bayesian network method. Using the Bayesian models of network regulatory 
data available to us from our knockout expression library wc are able to search 
upstream of drug affected target for genes, which have a known regulatory 
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control relationship over drug affected target expression. Here, wc focus on 
the nuclear receptor genes as the draggable genes because a) Nuclear receptor 
proteins are known to be useful drug targets and together represent over 20% 
of the targets for medications presently in the market, b) Nuclear receptors 
are involved in the transcription regulatory affects that are directly measured 
in cjDNA microarray experiments. 

Figure 3 shows a partial network, which includes the drug-affected genes 
(Bottom), the druggable genes (Top) and the intermediary genes (Middle)* 
Of course, we can find more pathways from the draggable genes to the drug- 
affected genes if we admit more intermediary genes. Due to the use of the 
Bayesian network model, we can find the intensities of the edges and can select 
more reliable pathway. This is an advantage of the Bayesian network model 
in searching for ideal druggable targets. In Figure 3, the draggable genes in 
the circle connect directly to the drug-affected genes and the other druggable 
genes have one intermediary gene per one druggable gene. From Figure 3, we 
can find the druggable genes for each drug-affected gene, e.g., we can find the 
draggable genes for MAL33 and CDC6 shown in Table 1- 
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Table 1* TJbe draggable genes of MAL33 and CDC 8. "Parents" means these genes connected 
directly to the drug-affected gene. "Grandparents" means there is one intermediary gene 
between these genes and the drug-affected gene. 



Drug-afTectc<3 MAL33 (YBR2S7W) : Maltose f*rm«itsitio7i regulatory pro torn 



Orrr^guble Parent* 

HSP82 (YP&240C) r Heat ahocJc protein 

SRIM (YBR022W) DNA-directed RNA polymsrw II holoenzyme and 
Kornbfcrg** mccj^tor ($RB) subcornpt^ac aubvnit 

BARl (YIL015W) r Bnrric-rp^pwrn precursor 

GPA1 (YHRO05C) i CTP-bindinK protcio fUpha dub unit of the pberoraone 
KAR2 (YJL034W) s unclear fusion protein 



Drug-afTcct^d . CPC6 (YJJL19AW) t Cell division control protein 



Drogg;abIo Parents 

ARP7 0YPRO34W) r Component of SWl-SNF gJobal trauoorlption activntor 

complex and RSC chromatin reinodeBns complex 
BARl (YXbOlSW) *» Barriwrpepino precursor 



GAL II (YOL051W) : DNA-directed RNA polymerase li holo enzyme and 

Kornbcr^fl mediator (SRJ3) snbcomplwit ou burnt 
FAR1 (yiUSVC) : Cy*lin-depend«nt kinase inhibitor (OKI) 
SLA2 (YNX243W) : Cytookelcton n-oscmbly Control protean 



4 Discussion 

In this paper, wc propose a new strategy for identifying and validating drug 
targets using computational models of gene networks. We showed that the clus- 
tering methods cannot provide the sufficient information and we that there is a 
need for the Mnd of hierarchical interaction data provided by network methods. 
We focus on the Boolean and Bayesian networks for estimating gene networks 
from microarray gene expression data. Theses two methods were originally 
proposed from the different points of view, however, both methods have rela- 
tive strengths and weakness and we can obtain more reliable information from 
employing both methods in our analysis. The Boolean network is ideally used 
for identifying the drug-affected genes by using the virtual gene technique set 
forth above. We also described the theoretical advantages of the virtual gene 
technique over results obtained from simple fold-change analysis. This is a 
positive point for using the simpler Boolean networks for certain purposes: 
However, we -show that the Boolean network cannot give the valuable informa- 
tion when we explore the draggable genes upstream of the drug-affected genes 
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in the estimated network. Wc solve this difficulty by applying the Bayesian 
network model to this problem. The Bayesian network model can provide the 
information of the upstream region of the drag- affected genes effectively and 
we can thus attain a set of candidate draggable genes for each drug-affected 
gene. The strategy proposed by this paper is established based on the so- 
phisticated use of a combination of two network methods. The strength of 
each network method can be clearly seen in this strategy and the proposed 
integrated method can provide a methodological foundation for the practi- 
cal application of bioinformatics techniques for gene network inference in the 
identification and validation of drug targets. 
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Example 6 

System for Discovering Network Relationships, and Uses Therof 

1) Objects 

Determining network relationships is highly desirable in numerous 
disciplines, including drug discovery, genomics, proteomics, computational networks, 
human networks, and other systems in which elements interact with each other in 
complex fashions. 

One of the most important points in drug development is to explore action 
sites of lead compounds. Screening of the drug involves the processes of 1) synthesis 
of the lead compounds and screening thereof, 2) retrieval of the action sites and direct 
approach to diseases, 3) improvement to more effective drugs and 4) improvement to 
the drugs having less side effects. The important technology not available today in 
the process of the drug development is an assay assist system taking advantage of 
genomes. It is currently very difficult to construct an assist system using human 
genomes, although the human genome has been already elucidated. Underlying 
difficulties are based on the fact that construction of models using 
knock-in/knock-out genes is very difficult, and the models should be established for 
respective organs since each organ has its own specific expression. 

Under these situations, it is considered to be an effective system for most 
effectively developing drugs having less side effects to determine the target site of the 
drug using yeast as an eucaryotic microorganism that may be easily modeled and for 
which abundant data have been accumulated, and to construct a system for 
extracting features of the network concerning the target site. Practically, differences 
of expression among the genes are accumulated using the knock-in/knock-out effect 
of each gene, and an expression control network is constructed from the data. Then, 
a lead compound is added to the culture medium of yeast utilizing this network, and 
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any action sites of the lead compound are estimated by comparing the difference of 
gene expression with accumulated data. GNI has developed a genome assay assist 
system to provide an analysis environment and services for assisting pharmaceutical 
companies with more accurate and efficient development of drugs. 

2) Technologies for Constructing Network 

As an achievement of the genome project that is considered to be one of the 
greatest molecular biology projects in the 20 th century, genome structures of many 
model animals have been elucidated, and analysis of the structure of even human 
genomes are now under way. Elucidation of genome functions are made possible by 
this rapid progress of the genome structure analysis that have been impossible by 
conventional biochemical and genetic methods. The most advanced results 
comprises transcriptome for investigating expression of total genes, and proteome for 
investigating expression of total proteins. In addition to these results, it has been 
practically available to elucidate the overall cell functions through investigation of the 
expression network of all genes constituting an organism by analyzing these data. 
Based on these facts, the next project for determining the genome structure is to 
elucidate genome functions involved in each organism. Studies for controlling overall 
functions of an organism have started by classifying and systematizing each genome 
depending on its nature. Consequently, a new era is now beginning when the 
observed data should be finally combined and connected to elucidation of the genome 
functions. - Under these situations, it is almost impossible to collect and analyze 
multiple lines of information using conventional methods by which a single gene is 
analyzed. 

A novel analysis method is required for attaining the object as hitherto 
described. In contrast to the conventional model constructed by experimentally 
cormrming respective relations, it should be considered to totally construct a model 
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from these large scale data. For example, a gene expression control network can be 
constructed by analyzing expression profiles and time course of total genes of an 
organism under various culture conditions, or by analyzing expression profiles in a 
mutant whose genes are destroyed. A study as shown in Fig. 1 is possible from a 
systematic point of view, and studies for assuming internal structure of the system 
from the data on input and output, and disturbance has became possible. When 
disturbance is considered to be destruction of a gene (no expression of the gene), it is a 
problem from the systematic point of view whether the gene expression system can be 
conjectured assuming that the input means expression of normal total genes and the 
output means expression of total genes when a part of the gene is destroyed. 

Identification of System from Systematic Point of View 

It is possible to collect experimental data for enabling the network to be 
assumed when the genome structure is elucidated. 

-Data of all the related genes can be measured. 

-Many tools for enabling many experiments to be performed - such as 

chips have been developed. 

When an output is measured by applying a disturbance to obtain many 
standardized data, then: 
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Is assumption of internal structure possible? 



FIG. 1 Molecular Network 

The first step that enabled this expression profile to be analyzed is 
emergence of micro-arrays 1 & that enables variation of many kinds of DNA or RNA 
(expression profile) to be simultaneously measured, or GeneChip method 3 ). "Nature 
Genetics, January, 1999" have issued a special edition 4 ) on the gene chips. This 
method is now beginning to be widely used in the biological and medical fields, and 
many venture enterprises have been established worldwide to start sales of total 
system involving the chips and trays (see, for example, www.inmcite.com and 
www.afiimetrix.com). DNA Chip Laboratories Co., Takara Brewery Co. and Hokkaido 
System Science Co. have also joined in manufacture of the chip in Japan. Since a 
vast amount of experimental results are expected to be obtained from these methods, 
novel hypotheses and analysis methods will be required. Such methods would enable 
to replace conventional models that have been constructed by experimentally 
confirming individual relations with novel models that may be constructed from large 
scale data at one stroke. For example, it is possible to construct a control network of 
gene expression by analyzing the expression profile of entire genes under a variety of 
environments. Search with comprehensive methods enables cause genes of diseases 
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to be explored and to construct a disease model 

Analyzing gene expression networks 

(1) Basic array technology 

The same principle as the conventional Southern blot method is basically 
used in the array experiments. Technical developments that have brought about a 
rapid advance in the array technology include development of measuring instruments 
by which quantitative ratio between the sample and control is precisely measured by 
binding DNAs in high density on a slide glass having good optical characteristics. 
Recently, 10,000 or more of cDNAs were successfully spotted on a slide glass using an 
instrument (an array er or a spotter) developed by Dr. P. O. Brown et al 1 ) of Stanford 
University in USA. Expression of the gene could be measured by hybridizing sample 
and reference DNAs labeled with two kinds or more of fluorescent substances with the 
cDNAs on the slide. In a different method, a GeneChip has been manufactured 
using a more industrially available photolithographic method by which 
oligonucleotides are directly synthesized in high density on the surface of the glass. 
This method is now being practically used 3 ). Otherwise, a method using an ink-jet 
mechanism is under development, and this method is expected to have a possibility 
that can form the array in higher speed with a higher density. 

(2) Practice of micro-array 

An example of practical experiments will be described hereinafter. In our 
laboratory, direct and indirect influences of genes are comprehensively collected by 
allowing some genes in budding yeast to be destroyed or to be excessively expressed to 
investigate expression profiles of the total genes. The experimental procedure of the 
micro-array will be described herein with reference to Figs. 2 and 3. 
(I) Total genes of the microorganism to be studied are prepared at first. Usually, a 
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set of PGR (polymerase chain reaction) primers (with a length of 20 bases) against the 
total genes are prepared, and PGR reactions are applied against the genomes or clones 
to obtained amplified fragments of the total genes. 

(II) DNA of each amplified gene is spotted on a slide glass using a robot called a 
spotter or arrayer, and the genes are fixed. Each spot has a diameter of 150 and 
2,500 genes are spotted per 1 square cm. 

(m) Subsequently, the sample is prepared. In the case of the budding yeast, a 
selected strain after the genome analysis and a mutant strain in which a specified 
gene has been destroyed are cultured under the same condition. mRNAs are 
extracted from both strains. cDNAs are prepared from these mRNAs, and are 
simultaneously labeled with two lands of fluorescent substances (having different 
excitation wavelength and fluorescence wavelength with each other). An equal amount 
of cDNAs prepared from respective strains are mixed. The mixture is placed on the 
micro-array, and cDNAs are bound with the probes fixed on the glass surface by 
complementary bonds. The slide is washed to remove non-specificalry bound cDNAs, 
thereby obtaining the expression profile of about 6,000 total genes in one experiment. 
(IV) The micro-array binding the sample is quantified with a two-wavelength 
fluorometer to calculate the sample to reference ratio. The fluorometer is required to 
have a high sensitivity since the spot size is very small. 
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FIG. 3 Measurement of expression profile using micro-array 

The mutant -cells in the drawing mean the cells treated with a chemical. 
The gene expression profiles are prepared for comparison among the profiles, followed 
by identification of the location of the profile in the network. 
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3. Analysis of gene expression net work by information technology 
(1) Kinds of data 

Data is acquired at one selected time, or the cells are sampled with a given 
time interval during cultivation to acquire the expression profile data at that time as a 
time course. Otherwise, the experiment is carried out at one point on the time axis 
under different experimental conditions in order to collect as many expression profile 
changes as possible. Table 1 shows a part of Gene Expression Matrix in which the 
data of the gene destroy experiments are compiled. The results of gene destroy 
experiments with respect to 6,000 genes are shown in the columns of the table as 
ratios to wild type strains, in which the data of respective destroyed strains are 
accumulated. 
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TABLE 1 Gene Expression MATRIX 
The columns denote the destroyed gene names, and the rows denote the 
gene names. The figures are represented as relative amounts of expression by taking 
the normal level as unity. 
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(2) Analysis model 

Determination of network from array data 

According to the prediction that the functional expression of microorganisms 
is defined by gene expression information based on the central dogma of the molecular 
biology, a line of information for elucidating the functions of organisms whose 
expression network has been investigated are expected to be obtained by analysis of 
the large scale gene expression data. Generally speaking, the methods used involve 
elucidation of gene groups having the same gene expressing pattern from the gene 
expression profiles issued from the project using clustering method*, 8). Apart from 
these methods, studies for presuming the gene expression control network have 
started worldwide. The basic object thereof is to develop calculation methods for 
forming a construction diagram of the gene expression control network as shown in 
Fig. 4 from the gene expression matrix or time course data. 

Guideline for Analysis o f Large Scale Network 

Gene expression matrix 
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of gene expression 



normal conirio** 




drfrtiaa of rctd 




Crcsc yiome 


: 


,1 52 *> — $30 


nanaai cwwt 


1 1 I — 1 


tWtlr %,<rc-c 7 


- 1 1 - O 
1 - I — O 

I r I % 
111"**- 








l\ i x.~- r \|, A-UlX 



Boolean network model \^y^ 
(multi-level digraph, model) [i* 

X 




Pynannc rxjodel^Y 
pstknaie £J) 
S-^system} 

"30 




Values in the same group 



(7V-H2) Positive interaction 
Qj-^Qj Negative interaction 



FIG. 4 Conceptual diagram of identification of gene network 
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Net work analysis using such model is largely divided into two groups. A 
boolean model is used in one group, wherein the relation among the genes is 
represented by a binary state in which one gene is controlled by the other gene or not?> 
1°). Different models deal with actual number of expression, wherein most of the 
models are based on differential equations related to dynamics of continuous values 11 * 
32). However, the model based on the differential equation requires some devices since 
the reaction constants involved in the model should be optimized in the expression 
control network analysis carried out against the total genome 13 ). An analyses method 
using a Bayes network have been introduced in addition to these method 14 ). 

2. Our trial 

(1) Expression profile construction program 

Since genome level studies have became possible today in the environment 
that allows multi-dimensional data to be readily measured, it is considered to be 
necessary to collect many expression profiles of gene destroyed or gene excess 
expression strains for the purpose of elucidation of the gene expression control 
network. We have also constructed many gene destroyed mutants for genes mainly 
. comprising the genes related to transcription, genes related to signal transfer and 
genes related to cell cycles as shown in Table 2, and are trying to collect their 
expression profiles. One hundred and seventy three mutants mainly involving 
transfer factors have been constructed today, and their expression profiles are under 
construction with a program for disclosing the expression profile data this year. 
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TABLE 2 Construction program of gene destroyed strains 



Category- 


Total 


Fatal Gene 


Goal 


Transfer System 


741 


239 


502 


Transfer Factor 


538 


142 


396 


Signal Transfer System 


126 


21 


105 


Cell Proliferation, Budding 


784 


190 


594 


and DNA Synthesis 









. (2) Trial for identifying the network 

The multi-level digraph method we are trying will be introduced below. 
While the network is determined by the method shown in Fig. 5, it involves the 
following steps. (1) A gene expression matrix GE is constructed from the gene 
expression profile data collected; (2) The effect of one gene on any of the other genes, if 
any, is inquired by an operation to assign 1 for the gene in which its expression ratio 
has been remarkably changed, and 0 for the others, thereby forming a binary matrix. 
(3) The parts in which this matrix forms a cycle is searched, and the parts are collected 
and defines as a new gene group B'. Since handling of the looped control structure is 
complex, trie parts are collected together to define as a new gene group again. (4) The 
columns and lows of the matrix are replaced to align related genes at an upper right 
portion of the matrix This makes most of network topologies of the genes that are 
controlled with each other to be clear. (5) Finally, the gene expression network may 
be further defined by removing the indirectly connected parts. The network 
determined by this method from the expression profile of 70 gene destroyed strains we 
have constructed is shown in Fig. 6. 
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Multi -level digraph approach 

Presumption of expression control network from expression profile 



<T) Gene expression matrix Ge 



@ Binomial relation As <3) Binary adjoining matrix A 



Gene 


Gl 


G2 


G3 


G4 


. G4 affects G3 
G3 affects G4 
G2 affects G3 


"\To 


Gl G2 G3 G4 


W3d type 


1 


1 


1 


1 


0 1 


1 1 


G4 dbroptant 


2.3 


3.5 


0.1 




P> G2 affects G4 


EZ3> G2 


0 0 


I 1 


G3 tfisroptant 


4.1 


2.3 




0.2 


Gl affects G2 


G3 


0 0 


0 1 1 




3.0 




0.3' 


0.3 


Gl affects G3 


G4 


0 0 


1 0! 


<3l (fisrupCsnt 




0.4 


0.3 


0.2 


Gl affects G4 









Induction of control relation among genes from 
trie gene expression matrix 
(6) Multi-level digraph © Skeleton matrix R 



Gl 
G2 
{G3!g4} 





Gl G2 G3 f 


Gl 


0 1 0 


G2 


0 0 1 


G3' 


0 0 0 



Gene group having a loop structure 



is treated as an equivalence group 



2\To 
From — ^ 


Gl G2 G3' 


Gl 


0 11 


G2 


0 0 1 


G3* 


0 0 0 



GMG3.G4} 



FIG. 5 Outline of multi-level digraph method 
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FIG. 6 A part of network determined from 70 destroyed strains 

4. Problems in the future 

The most largest problem is errors involved in the data. Since the currently 
available experimental technologies may give rise to about errors in the order of from 
30% to 40%, developments of infbrmatex technologies is necessary based on such 
errors. However, the method for dealing with such large errors is not currently 
sufficient. Therefore, methodologies capable of experimentally reducing the errors, or 
methodology for estimating the errors to a certain extent are expected to be 
introduced. 

Developments of display methods of the results are important for treating a 
vast number of genes. Since the network among 6,O0O genes should be displayed for 
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transcription factors were selected for analysis. We constructed an initial 
552 gene-member model of regulatory control relationships and then 
constructed a sub-network model composed of the 98 known transcription 
factors. In creating these models we employed a Boolean computational 
algorithm as reported in Shoudan, et. Ah x . An outline of our method is 
shown in fig.7. 



Multi-level digraph approach 

Reconstruct the network classified into the genes and some equivalence sets 
using a gene expression matrix. 

) gene expression matrix G E (2) binary matrix Ab (D adjacency matrix A 

G4 affects G3 
G3 affects G4 
G2 affects <33 
G2 affects G4 
Gl affects G2 
GJ affects G3 
Gl affects G4 
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Wild type 


1111 
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110- 
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1 I - 0 


Gl cSLsrvptam 


I - 0 0 




- 0 0 0 



Gl G2 G3 G4 


Gl 


0 1 


1 1 


G2 


0 0 


1 1 


G3 


0 0 


0 1 


G4 


0 0 


I 0 



Derive binary relations between genes nosing a gene expression matrix, 

©rnnltMeveiaignyh © skeleton relation R ©Partition genes ^o eqmvalrace 

sets (belonging to closed path} 



Gl G2 03' 


Gl 


0 1 0 


G2 


0 0 1 


G3* 


0 0 0 



Gl 

+ | Gl G2G3' Gl G2G3' 



G2 <#S Gl 0 10 Gl Oil 

+ l G2 0 0 1 G2 0 0 1 

(G3G4) GV °°° g ° ° ° 

GMG3,G4} 

find ont a partial order between equivalence sets and skeleton network, and define 
Boolean functions between genes; 



Figure 7 

Figure 7. Construction of a gene regulatory subnetwork model with 
the Boolean method. First, a gene expression matrix is created from a set of 
gene disruption experiments. Each element of the matrix indicates the ratio 
of gene expression and the binary relation matrix shows that if gene 'a* is 
deleted and intensity of gene Tb' is largely changed, gene 'a* affects gene T>\ If 
gene a and gene V affect each other mutually, they form a loop (strongly 
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connected component). An equivalence set is introduced which treats a loop 
logically as a gene. An equivalence set is a group of genes that affect each 
other or only one gene. Next genes are partitioned into equivalence sets. 
These semi-ordered accessibility matrices among equivalence sets are 
describe influence as an equivalence set group even when individual genes 
with the set do not directly influence each other. Connections that are also 
included in lower rows are * then removed to build the hierarchical 
connections in the network and generating the network topology of the gene 
in control relationships. 

The network model constructed by Boolean models of the expression 
experiments resulted in a well-defined regulatory transcription factor 
sub-network involved in mitosis and meiosis pathway. Fig. 8 shows this 
transcription factor sub-network in detail. As can be seen in this figure, the 
sub-network describes specific regulatory relationships among transcription 
factors involved in mating, meiosis and DNA structure and repair 
mechanisms. For example HAP2 y which is a gene whose function has yet to 
be reported shows indirect regulatory influence on Alpha2 7 a known mating 
response gene via an intermediary product, THI2 whose specific function is 
likewise unknown. Furthermore, various elements related -to chromosome 
structure and DNA repair thought to be unconnected to known meiosis 
pathways are influenced by MBT28 and CIN5. Both of these results make t 
sense biologically; the dependence on error-free DNA replication for 
successful sexual reproduction and the relationship between meiosis and 
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mating are obvious. 

To understand the generalized regulatory relationships among 
various functional groups of transcription factors, we classified transcription 
factors in the network model with their ^cellular roles". A "cellular role" here 
is defined as the major biological process involving the protein in YPD. We 
defined 10 groups for categorization in our transcription factor network. Fig 
8a displays graphically the interrelationships of classified transcription 
factors in our network, and Fig 8b. shows the numerical relationships among 
the groups. 

As seen in Figure 8a, the cell cycle transcription factors were influenced 
by only a two of the 10 categories of transcription factors and only three 
genes in total. The three elements that independently influenced cell cycle 
transcription factors were related to meiosis, mating response, and cell stress, 
processes that for functional reasons would necessitate limited interaction 
with the cell cycle process. It is known that eukaryotic organisms use 
discreet and highly conserved cell-cycle checkpoints to ensure that nuclear 
division is restrained while DNA is undergoing replication or repair (3^-37)^ 

Likewise, it is apparent that the lipid fatty-acid metabolism transcription 
factors were affected by the carbohydrate metabolism transcription factors. 
Our data demonstrates, for example, that expression of IN04 y a UDFHGDHG, 
is influenced by RTG1, a JJHGDGGH. Also, PDR1, a BIG HORSE is 
influenced by the" expression of CAT8, a SMALL DOG. Other such 
interactions between proteins involved in phospholipid synthesis pathway 
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with genes of the glucose response pathway, the lipid signaling pathway and 
other lipid synthesis pathways have been well described previously an the 
literature* 26 ). It is well known that the genes encoding many enzymes of 
phospholipid biosynthesis all contain variants of UASINO in their- promoters 
and these are regulated in response to growth phase and nutrient starvation 
(G.M. Carman (1991)). Especially, Snflp/Snf4p and Irelp are reported that 
the key molecules that bind to UASINO sequence, which are associated with 
the glucose response, are required for derepression of UASino- containing 
genes. 

To confirm validity of our models on a genome-wide basis, we 
searched for the existence of previously described regulatory relationships 
from the literature within our 552 gene-member regulatory model. Of 98 
known transcription factors, the regulatory networks upstream and 
downstream of 26 of these can be elucidated from the interactions reported 
in the literature. We examined how many of these pathways for well-known 
genes were reconstructed in our model. As a result, we found that 
approximately 27% of these relations were apparent from our expression 
data. 

Our gene regulatory network construction using gene disruptant 
based expression profile data along with Boolean computational models has 
elucidated a number of gene regulatory relationships that make specific 
predictions relevant to pathway biology. Further tuning of these models 
from new biological insights and correlation of data with other expression 
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data such as protein expression experiments will allow for increased 
resolution of specific pathway interactions. In addition, we are currently 
investigating other computational methodologies for examination of 
stochastic relationships contained in expression profile data such as 
Bayesian Modeling. Use of this and similar strategies for the elucidation of 
dynamic regulatory pathways from static expression data will allow for a new 
era of cost-effective, high throughput research into novel biological pathways 
and mechanisms. 

Figure 8. Gene regulatory networks by cellular function. Figure 8a. 
A regulatory subnetwork of 98 transcription factors grouped by cellular 
function according to information provided in the YPD database. Genes 
inside the circles are those grouped into a given cellular functional category. 
Lines of hierarchical relationship in regulatory control are depicted with 
colored lines with the lines the same color as the category of genes from 
which the control relationship emanate toward the genes on which they exert 
regulatory influence. The bold red lines indicate regulatory control of 
transcription factors related to cell cycle originating from genes in the 
meiosis/mating response group. The dark blue lines indicate control 

relationships eminating from the carbohydrate metabolism group exerting 

* 

influence over genes in the lipid fatty-acid metabolism group. This result is 
supported by the literature and also common sense. This well- con served 
gene regulatory relationship insures that you will get fat if you eat too many 
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sweets, b. Frequency of control relationships among functional categories of 
transcription factors. Row headings indicate gene control responses 
initiating from the category and column headings represent the categories of 
genes controlled. Numbers indicate the number of discreet control 
relationships initiated from a gene in one category and exerting control over 
a gene in the other category. The bold red numbers highlight the relatively 
sparse control relationships exerting influence over the cell cycle 
transcription factors. Bold blue numbers highlight the relatively sparse 
control relationships exerting control over lipid fatty acid metabolism, both of 
which emanate from the carbohydrate metabolism category. 

Figure 9. A detailed view of a gene regulatory subnetwork of 
transcription factors reconstructed from Boolean analysis of gene expression 
experiments on disrupt ant mutant yeast strains. Black lines indicate a 
regulatory relationship, with the arrows showing the hierarchical direction of 
the expression influence. The colors and shapes of the nodes denote the 
general categories of cellular function of the gene product according to their 
descriptions in the YPD database. Genes related to cell division 
mechanisms are indicated with triangular nodes and genes related to DNA 
repair and chromosome structure are depicted with squares. The elucidated 
network shows novel topological control relationships among genes related to 
meiosis, the mating response and DNA structure and repair mechanisms. 
Genes related to meiosis and mating response genes are downstream of a 
cascade regulated by IN02 and a further sub-grouping of genes related to 
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DNA repair and structure appears hierarchically downstream of UME6. 

Figure 10 depicts a schematic diagram of a system for discovering 
networks of this invention. Cells (top line: a A n ) are grown and treated with 
and without interventions to produce mRNA reflecting the treatments of lack 
thereof. Messenger RNA is reverse transcribed into cDNA and spotted onto 
arrays (second line from top). Information regarding the levels of expression 
is collected and stored in the Database. Information from the Database is 
analyzed according to the network modeling of this invention to produce a 
network of relationships between the different genes (in box). The output 
from the system is then displayed on a monitor or other visualization tool. 

The above approach in determining networks can be applied to a variety of 
linear and non-linear analysis methods, including, but not limited to Boolean analysis, 
Bayesian analysis, graphical modeling, clustering analysis, maximum likelihood 
estimation, maximum penalized likelihood estimation and other types of analytical 
methods. 
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INDUSTRIAL APPLICABILITY 

Methods for determining network relationships between genes is useful 
in the development of lead compounds in the pharmaceutical, health care and 
5 other industries in which relationships between genes is desired. Inferential 

methods of this invention find application in any area of statistical analysis in 
which complex relationships between groups of data are desired. Such areas 
include engineering, economics and biology. 
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We claim: 

1 . A method for constructing a gene network, comprising the steps of: 

(a) providing a quantitative disruptant data library for a set of genes 
of an organism, said library including expression results based on 

5 disruption of each gene in said set of genes, quantifying an 

average effect and a measure of variability of each disruption on 
each other of said genes; 

(b) creating a gene expression matrix from said library; 

(c) generating network relationships between said genes; and 
10 (d) determining if one or more groups of genes is expressed 

differently from other of said groups of genes. 

2. The method of claim 1, further comprising the step of: 

(e) providing a Bayesian computational model, wherein said 
1 5 Bayesian model comprises minimizing a BNRC criterion. 

3 . The method of claim 2, wherein said step of minimizing a BNRC 
criterion comprises using a non-linear curve fitting method selected 
from the group consisting of polynomial bases, Fourier series, wavelet 

20 bases, regression spline bases and B-splines. 

4. The method of claim 1, wherein said data library is created using a drug 
to alter gene expression. 

25 5. The method of claim 2, wherein said step of minimizing said BNRC 

criterion further comprises selecting a Bayesian mode using a 
backfitting algorithm. 

6. The method of claim 2, wherein said step of minimizing a BNRC 
30 criterion further comprises using Akaike's information criterion. 

7. The method of claim 2, wherein said step of minimizing a BNRC 
criterion further comprising using maximum likelihood estimation. 
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8. The method of claim 1, wherein said genes are associated with a cell 
cycle. 

9. The method of claim 2, wherein said measure of variability is variance. 

5 

10. The method of claim 3, wherein said non-linear curve fitting method is a 
non-parametric method. 

1 1 . The method of claim 1 0, wherein said non-parametric method for 
10 minimizing a BNRC criterion includes using heterogeneous error 

variances. 

12. The method of claim 11, wherein said step of minimizing a BNRC 
criterion further comprises the steps of: 

15 (1) making a score matrix whose (/, j) ih element is the BNRC^ ctero 

score of the graph gene* -» gene/; 

(2) implementing one or more of add, remove and reverse which 
provides the smallest BNRCheterol and 

(3) repeating step 2 until the BNRCa e , era does not reduce further. 

20 

13. The method of claim 11, wherein said step of minimizing a BNRC 
criterion further comprises the step of applying a hill-climbing algorithm 
to minimize BNRC® hetero- 

25 14. The method of claim 1 1 , wherein an intensity of the edge is determined 

using a bootstrap method. 

15. The method of claim 14, wherein said bootstrap method comprises the 
steps of: 

30 (1) providing a bootstrap gene expression matrix by randomly 

sampling a number of times, with replacement, from the original 
gene library expression data; 
(2) estimating the genetic network for gene, and gene,-; 
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(3) repeating steps (1) and (2) T times, thereby producing T genetic 
networks; and 

(4) calculating the bootstrap edge intensity between gene* and genev- 
as (ti 4- t 2 )/T. 

5 

16. A method for elucidating a gene network, comprising the steps of: 

(a) providing a raw data library of time-course gene expression data 
for a plurality of genes of an organism; 

(b) subtracting background signal intensities from said raw data 
10 library; 

(c) calculating the relative change in gene expression for each of 
said plurality of genes; 

(d) analyzing the statistical significance of said relative in gene 
expression using Student's t-test; and 

1 5 (e) fitting said changes in gene expression to a linear spline function. 

17. The method of claim 16, further comprising the step of removing from 
consideration, those genes whose expression levels are sufficiently low 
so as to be determined predominantly by noise. 

20 

18. The method of claim 1, wherein said step of grouping comprises 
grouping said genes into one or more equivalence sets. 

19. A method for estimating a gene network relationship and optimizing 
25 hyper parameters of said relationship, comprising the steps of: 

(1) fixing hyper parameter p/ 9 

(2) initializing y jk = 0 , k - 1 , . . ., gj; 

(3) finding the optimal $ jk by repeating steps 3-1 and 3-2; 
(3-1) computing: 

30 Yjk = (B T Jk WjkBjk + nfijkKjO ' 2 B T Jk W Jk x (x© - 2 B Jk >% k >), 

for fixed Py*; 

(3-2) repeating step 3-1 against a candidate value of $ Jk and 
choosing the optimal value of P/*. which minimizes BNRChetero 
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10 



15 



(4) repeat step 3 for k= 1, . . . , qf, 1, . . - , qf, 1, . . . until a suitable 
convergence criterion is satisfied; and 

(5) repeat steps 1 to Step 4 against a candidate value of pj 9 and 
choosing the optimal value of pj, which minimizes the BNRChetero- 

20. A method for constructing a gene network model of a system containing 
a network of genes comprising using a Bayesian computational model, 
wherein said Bayesian computational model comprises minimizing a 
BNRC criterion. 

21 . The method of claim 20, wherein minimizing the BNRC criterion 
comprises using a non-linear curve fitting method selected from the 
group consisting of polynomial bases, Fourier series, wavelet bases, 
regression spline bases and B-splines. 

22. The method of claim 20, wherein minimizing the BNRC criterion 
comprises selecting a Bayesian mode using a backfitting algorithm. 

23. The method of claim 20, wherein minimizing the BNRC criterion 
20 comprises using Akaike' s information criterion. 

24. The method of claim 20, wherein minimizing the BNRC criterion 
comprises using maximum likelihood estimation. 

25 25. The method of claim 20, wherein minimizing the BNRC criterion 

comprises using a non-linear curve fitting method, wherein the non- 
linear curve fitting method is a non-parametric method. 

26. The method of claim 25, wherein the non-parametric method includes 
30 using heterogeneous error variances. 

27. The method of claim 26, wherein minimizing the BNRC criterion 
further comprises the steps of: 
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(1) making a score matrix whose (i,j) th element is the BNRC^ew 
score of the graph gene, — > gene,-; 

(2) implementing one or more of add, remove and reverse which 
provides the smallest BNRC/ 2eZero ; and 

(3) repeating step 2 until the BNRChetero does not reduce further. 

28. The method of claim 26, wherein minimizing the BNRC criterion 
further comprises the step of applying a hill-climbing algorithm to 
minimize BNRC^ e / cro . 

29. The method of claim 26, wherein an intensity of the edge is determined 
using a bootstrap method. 

30. The method of claim 29, wherein said bootstrap method comprises the 
15 steps of: 

(1) providing a bootstrap gene expression matrix by randomly 
sampling a number of times, with replacement, from the original 
gene library expression data; 

(2) estimating the genetic network for gene z and gene/; 

20 (3) repeating steps (1 ) and (2) T times, thereby producing T genetic 

networks; and 

(4) calculating the bootstrap edge intensity between gene, and gene/ 
as (ti + t 2 )/T. 

25 31. The method of claim 20, wherein the Bayesian computational model is 

used to analyze a gene expression profile of the system. 

32. The method of claim 31, wherein the gene expression profile comprises 
the level of gene expression of each gene in the system. 



30 



33. The method of claim 32, wherein at least one gene in the system is 
disrupted. 
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34. The method of claim 32, wherein the gene expression profile comprises 
a sub-gene expression profile and wherein the sub-gene expression 
profile comprises the level of gene expression of each gene in the system 
when at least one gene is disrupted in the system. 

35. The method of claim 34, wherein the gene expression profile comprises 
at least two different sub-gene expression profiles. 

36. The method of claim 32, wherein the system is treated with an agent. 

37. A method for constructing a gene network model of a system containing 
a network of genes comprising using a Bayesian computational model 
and a Boolean method. 

15 38. The method of claim 37, wherein said Bayesian computational model 

comprises minimizing aBNRC criterion. 

39. The method of claim 37, wherein the Bayesian computational model and 
the Boolean method are used to analyze a gene expression profile of the 

20 system. 

40. The method of claim 39, wherein the gene expression profile comprises 
the level of gene expression of each gene in the system. 

25 41 . The method of claim 40, wherein at least one gene in the system is 

disrupted. 

42. The method of claim 40, wherein the gene expression profile comprises 
a sub-gene expression profile and wherein the sub-gene expression 

30 profile comprises the level of gene expression of each gene in the system 

when at least one gene is disrupted in the system. 

43. The method of claim 42, wherein the gene expression profile comprises 
at least two different sub-gene expression profiles. 
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44. The method of claim 40, wherein the system is treated with an agent. 

45. A data file comprising a gene network model constructed by the method 
of claim 20. 

46. The data file of claim 45 in a computer readable form. 

47. The data file of claim 45 accessible from a remote location. 

48. The data file of claim 45 accessible from an internet web location. 



49. A method for identifying a target gene of an agent in a system 
containing a gene network comprising: 
15 (a) constructing a first and second gene network model using a 

Bayesian computational model, 

wherein said Bayesian computational model comprises 
minimizing a BNRC criterion, 

20 wherein the first gene network model is obtained by 

analyzing a first gene expression profile and the second 
gene network model is obtained by analyzing a second 
gene expression profile, 

25 wherein the first gene expression profile is obtained from . 

the system without being treated with the agent and the 
second gene expression profile is obtained from the 
system being treated with the agent, and 

30 (b) analyzing the first and second gene network model using said 

Bayesian computational model, wherein the agent is considered 
as a gene in the system, and wherein a target gene of the agent is 
identified. 
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50. The method of claim 49, wherein the target gene is a gene directly 
affected by the agent. 

51. The method of claim 49, wherein the target gene is a gene indirectly 
affected by the agent. 

52. A data file containing the identity of one or more target genes of the 
agent obtained according to the method of claim 49. 

53. The data file of claim 52 in a computer readable form. 

54. The data file of claim 52 accessible from a remote location. 

55. The data file of claim 52 accessible from an internet web location. 

56. A method of providing a service comprising receiving an agent from a 
party, and identifying a target gene of the agent for the party according 
to the method of claim 49. 

57. The method of claim 56, wherein receiving an agent comprises receiving 
the identity of the agent. 

58. A method of providing a service comprising receiving an agent from a 
party, and identifying a target gene of the agent for the party using the 
gene network model constructed according to the method of claim 20. 
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Fig5 
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Fig 6 



Yes 



Input data file 



Make the BNRC score matrix, where (i,j) element represents the score when gene j is 
the parent gene of gene i. 



Construct the set of candidate parent genes of each gene from the BNRC score matrix. 
The maximum number of candidate parent genes is given by the user by -p option. 



Set r=l. 

r is the number of permuted sequences given by the user by -r number option. 



Permute the computational order of genes 



Seti=l 

i is the number of purmuted gene. 



For each gene j (j is not equal to i), carry out one of the three procedures for the edge: 
1) add the edge j -> i, 2)remove the edge j -> i (if gene j is the parent gene of gene i until 
the previous step), 3)reverse (if gene j is the child gene of gene i until the previous step), 
which gives the smallest BNRC score. 




6/7 



WO 03/027262 



PCT/US02/31093 



Fig. 7a 

AAH1 ,AAT1 ,ADE1 3,ADE3,AGA2,ALD4 > ALPHA2,APC1 ,APG1 3,APM3 ) APT2,AR03,AR09,ASK1 0.ATP7.AT 

P8 I ATP9,BCS1,CBF5,CDC15,CDC37,CDC50,CEM1,CHD1,CIT1,CLN2,CMK2,CNB1,COS6,COX17,CPA1, 

CPR7,CTT1,CWP2 > CYP2,DAL3,DBF2,DBP2,DBP3,DLD1,DNA2,DPH2,ECM29,ECM4,ENP1,FAS1,FIG2,F 

OL1 ,FRE1 ,GAL2,GAL7,GAT1 ,GBP2,GCD1 0.GCY1 ,G!C1 ,GLK1 ,GTT1 ,GTT2,GUA1 ,HAS1,HEM1 2.HGH1 ,HI 

S4.HKR1 ,HMG1,HMS2,HSP1 2.HXK1 ,HXT17,HXT6,HYM1 ,IMG1 ,JA2,JEN1 ) KEL3,KIN28,LCB1 ,LYS2,LYS4, 

MAL13,MAI^3,MBP1,MIS1,MLP1,MRP4,MRPL36,MRT4,MSS4,NGG1,NHP2,NOG1 l NOP1,NOP2,NOP5,N 

SR1 ,NTG2,NUP1 1 6,OM45,ORF_Q0144,OSH1 ,OST3,PGM2,PHB2,POM1 52.PPT1 ) PTC4,PUS2,PUT2,PXA 

2,PYC1 > PYK2,RGA2,RPA135,RPA14,RPA49,RPC40 ) RPL15B,RPL16B,RPL19B,RPL3,RPL31B,RRN3,RTT 

1 03.SAC2.SAM1 ,SAM3,SCC2,SCW4,SDS23,SDS24,SEC1 1 ,SEC63,SEN1 ,SK01 ,SMM1 ,SNF5,SOL1 ,SOL4 

> SPI1 ( SRP40,STE5,SWI5 ) SXM1 > TAF61,TBS1,TFS1,THI12,TID3,TIF3,TOM7,TPS2,TRM1,UBP11 1 UGA1 > UR 

A7,VAR1,XBP1,YAL028W,YAL036C,YAL060W,YAP6,YAR009C,YBR204C,YBR238C,YBR271W,YCL036W 

,YCL042W,YCL056C,YCR044C,YCR062W,YCR086W,YCR087W,YDL034W,YDL063C,YDL085W,YDL139 

C,YDL204W,YDL214C,YDL222C,YDL223C,YDL239C,YDR026C,YDR070C,YDR091C,YDR101C,YDR134C 

,YDR153C,YDR186C,YDR233C,YDR326C,YDR332W,YDR361C,YDR366C,YDR380W,YDR384C,YDR387 

C > YDR479C,YDR493W,YDR504C,YDR516C,YER049W,YER067W,YER084W,YER175C,YER182W,YFR0 

12W,YGL037C,YGR001C,YGR064W,YGR086C,YGR101W,YGR160W,YGR196C,YGR200C,YGR228W,Y 

GR235C,YGR245C,YGR271W,YHB1,YHL021C,YHL026C,YHL037C,YHM2,YHR020W,YHR059W,YHR087 

W I YHR097C,YHR155W,YHR186C,YIL044C,YIL124W,YIL149C,YIL169C,YIR036C,YJL021C,YJL066C,YJL 

068C,YJL086C,YJL161W,YJL200C,YJL211C,YJR041C,YJR070C,YJR071W,YJR120W,YJR157W,YKL014 

C,YKL026C,YKL036C,YKL1 1 1 C.YKL1 79C,YKR088C,YKR096W,YLA1 ,YLL064C,YLR002C,YLR033W,YLR0 

73C,YLR177W,YLR270W > YLR327C,YMC2 > YML047C,YML128C,YMR090W,YMR131C,YMR250W,YNL058 

C,YNL095C > YNL114C,YNL132W,YNL156C,YNL170W,YNL175C,YNL195C,YNL203C,YNL305C,YNR037C, 

YNR059W,YOL027C > YOL071W,YOL082W,YOL084W,YOL106W,YOL155C,YOR004W,YOR093C,YOR121 

C,YOR145C > YOR146W > YOR154W > YOR155C,YOR173W,YOR289W,YOR309C,YOR325W,YPL134C ) YPR1 

18W,YTM1,YTP1,ZPR1 



Fig. 7b 
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